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Abstract 

Numerous  signal  processing  systems  in  the  Department  of  Defense  and  industry  would  ben¬ 
efit  from  a  microprocessor  tailored  to  their  specific  applications.  This  thesis  effort  describes  the 
architecture  and  design  for  a  64  bit  application  specific  processor  (ASP)  which  combines  tlie  power 
of  double  precision  floating  point  hardweire  with  the  flexibility  of  a  laser  programmable  microcode 
store.  This  floating  point  ASP  (FPASP)  contains  a  variety  of  circuits  which  efficiently  perform  dig¬ 
ital  signal  processing  and  other  applications  requiring  double  precision  floating  point  arithmetic. 
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This  thesis  describes  a  new  rapid  rapid  prototyping  methodology  for  ASPs.  The  user  provides 
an  algorithm  which  is  translated  into  microcode,  tested  on  a  software  model,  and  then  cut  into  the 
laser  PROM  of  a  blank  FPASP  chip.  With  this  methodology  the  prototyping  of  an  ASP  can  be 
completed  in  a  matter  of  days,  and  no  hardware  design  is  involved.  The  programmed  FP.\SP  can 
then  be  mounted  on  a  circuit  board  and  placed  in  a  host  processor  to  act  as  a  hardware  accelerator 
for  computationally  intense  programs.  The  FPASP  also  supports  a  macro  assembly  language  which 

can  be  partially  user-defined.  So  the  FPASP  can  be  tailored  to  higher  level  applica),ions  such  as 

\  .t,.  Ti.  ,  Vjr'.Kj 
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The  FPASP  has  been  designed  to  support  common  software  structures.  It  contains  re^ 
for  loop  indexing,  and  addressing  into  matrices.  The  FP.\SP  also  contains  a  subroutine  slack  16 
words  deep,  which  can  be  extended  into  the  external  memory  for  an  additional  102:1  words  to 
support  recursive  microcode  routines.  The  FPASP  will  contain  180,000  CMOS  transistors  on  a 
chip  0.35  inches  on  a  side.  It  is  designed  to  operate  at  25  MHz,  and  at  that  speed  it  will  he  capable 
of  performing  25  million  floating  point  operations  a  second. 


operating  system  support. 


The  FPASP  architecture  consists  of  two  32  bit  processors  which  can  operate  independently  for 
integer  operations,  or  in  tandem  for  double  precision  floating  point  operations.  Overlapping  register 
sets  on  each  datapath  and  ties  between  the  two  datapaths  provide  a  high  degree  of  interconuectivity, 
allowing  efficient  internal  data  transfer  between  the  66  32  bit  registers  and  the  processing  circuits. 


Architecture  and  Design  for  a  Laser  Programmable 
Double  Precision  Floating  Point 
Application  Specific  Processor 


I.  Introduction 


1.1  Background 

Continuing  advances  in  integrated  circuit  fabrication  make  possible  ever  smaller  feature  sizes, 
allowing  more  circuitry  to  be  placed  on  a  chip.  The  hardware  required  to  perform  double  precision 
floating  point  arithmetic,  which  is  often  packaged  as  a  separate  “coprocessor,”  can  now  be  placed 
on  the  same  chip  as  the  basic  processing  hardware,  and  the  chip  can  be  kept  down  to  a  size  that 
allows  reasonable  yields. 

Recent  research  in  laser  technology  at  the  Massachusetts  Institute  of  Technology  Lincoln 
Laboratory  has  shown  that  circuits  can  be  designed  and  fabricated  using  standard  procp.sses  and 
then  altered  after  packaging.  A  laser  programmable  read-only  memory  (LPROM)  can  contain  the 
microcode  program  used  in  an  application  specific  processor  (ASP).  The  design  of  an  LPROM 
based  on  this  technique  is  the  topic  of  a  concurrent  thesis  by  Capt.  Tillie  [Til88].  The  hardware 
needed  to  program  the  LPROM  and  the  software  needed  to  automate  the  process  have  also  been 
developed  as  part  of  that  effort. 

Research  projects  at  AFIT  have  produced  a  single  precision  floating  point  multiplier  and  a 
library  of  sub-circuits  for  a  bit-slice  application  specific  processor  [Gai87].  These  circuits  were  part 
of  a  methodology  for  producing  prototypes  of  application  specific  processors  by  assembling  cells 
from  a  cell  library,  then  fabricating  the  circuit.  Their  purpose  was  to  decrease  the  time  required 
for  designing  these  types  of  circuits.  A  test  processor  based  on  this  methodology  was  designed 
and  fabricated  in  3  micron  CMOS.  The  single  precision  floating  point  multiplier  has  also  been 
fabricated. 

AFIT  is  currently  researching  digital  signal  processing  for  space  surveillance  applications. 
One  of  the  algorithms  used  for  digital  signal  processing  is  Kalman  filtering.  This  algorithm  is 
computationally  intensive.  It  would  be  more  efficient  to  have  a  dedicated  piece  of  hardware  to 
carry  out  the  computations  rather  than  use  a  general  purpose  computer.  This  makes  the  algorithm 
a  good  choice  for  implementation  with  an  ASP,  To  obtain  the  most  accurate  results,  a  double 
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precision  data  format  has  been  chosen.  Hardware  for  double  precision  floating  point  arithmetic  is 
currently  being  designed. 

These  projects  set  the  stage  for  this  thesis,  which  will  bring  them  together  with  the  techno¬ 
logical  advances  mentioned  in  the  design  of  a  laser  programmable  processor  to  meet  the  needs  of 
the  Kadman  filter  and  similar  computational  problems. 

l.S  Problem  Statement 

This  research  will  create  a  processor  architecture  and  VLSI  design  which  will  support  double 
precision  floating  point  arithmetic  and  will  be  micro-programmable  after  fabrication  for  specific 
applications. 

1.3  Objectives 

The  objective  of  this  thesis  is  to  take  advantage  of  the  new  1.2  micron  CMOS  technology  to 
show  that  a  double  precision  floating  point  ASP  can  be  put  on  a  single  chip  and  contain  enough 
proc  ssing  hardware  to  be  useful  for  a  wide  range  of  applications.  In  addition,  laser  programming 
technology  will  be  employed  to  allow  the  ASP  to  be  tailored  to  the  specific  application  aftci-  fabri¬ 
cation.  An  architecture  will  be  developed  for  a  laser  programmable,  double  precision  floating  point 
application  specific  processor  (FPASP).  The  FPASP  will  form  the  basis  of  a  new  rapid  prototyping 
methodology  wherein  the  “fabricate  a  new  chip  for  each  application”  paradigm  previously  applied 
to  ASPs  will  be  replaced  with  “fabricate  a  generic  ASP  in  quantity  then  program  copies  of  it  as 
needed”. 

The  FPASP  will  support  the  software  structures  used  in  matrix  algebra  microcode.  These 
structures  include  nested  loops,  recursion,  and  data  stored  in  row-major  order.  Microcode  written 
for  the  FPASP  will  demonstrate  the  usefulness  of  the  architecture  for  handling  a  variety  of  appli¬ 
cations  which  use  these  structures.  The  FPASP  will  support  an  assembly  language  which  can  be 
partially  user-defined. 

The  FPASP  must  also  be  able  to  perform  a  Kalman  filter  routine  without  laser  programming, 
in  case  the  laser  hardware  is  not  ready  when  the  first  application  must  be  done. 

1-4  Scope 

1.4-1  Scope  of  the  Thesis  Report.  This  report  will  detail  the  architecture  and  de¬ 
sign  of  the  FPASP,  exclusive  of  the  floating  point  hardware  and  the  laser  programmable  memory 
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[May88,FVe88].  These  circuits  are  the  subject  of  other  projects  concurrent  with  this  thesis  effort, 
and  so  will  only  be  briefly  discussed  in  this  report. 

Microcode  written  as  part  of  this  effort  and  as  a  class  project  for  the  EE588  class  from  the 
summer  quarter  1988  will  be  used  to  demonstrate  how  the  hardware  supports  software  structures. 
In  addition  to  this  microcode,  the  microcode  and  hardware  which  supports  an  assembly  language 
will  also  be  discussed. 

Models  of  the  FPASP  used  for  verifying  the  design  of  the  subcells  will  be  discussed.  A  high- 
level  structural  model  written  in  the  VHSIC  Hardware  Description  Language  (VHDL)  will  be  listed. 
The  VHDL  model  will  become  part  of  the  new  methodology  for  the  rapid  prototyping  of  ASPs. 
The  methods  used  to  verify  the  proper  operation  of  the  subcircuits  will  also  be  covered. 

1.4.2  Scope  of  the  FPASP  Architecture.  The  scope  of  problems  for  which  the  FPASP 
was  designed  is  the  general  class  of  problems  which  manipulate  matrix  type  databases  stored  in 
row-major  order.  A  library  of  code  to  be  stored  permanently  on  the  FPASP  will  contain  subroutines 
which  perform  basic  matrix  algebra  operations  such  as  dot  product  and  matrix-matrix  multiply. 

The  FPASP  most  efficiently  supports  two  data  representations;  32-bit  integers  and  64-bit 
IEEE  double  precision  floating  point  numbers.  The  built-in  floating  point  hardware  will  include  a 
multiplier  and  an  adder/subtractor.  The  FPASP  is  designed  to  directly  support  nested  loops  to  a 
depth  of  six,  and  recursive  subroutine  calls  to  a  depth  of  1040. 

Three  features  extend  the  scope  of  the  FPASP  concept  beyond  the  original  ASP  concept. 
One  is  the  ability  to  write  microcode  into  the  FPASP’s  laser  programmable  ROM  (LPROM)  after 
it  is  fabricated,  rather  than  fabricating  an  entirely  new  chip  each  time  an  ASP  is  needed.  Another 
is  an  assembly  language  which  is  partially  coded  into  the  FPASP’s  fixed  ROM  library  and  partly 
user-definable  via  the  LPROM  and  a  laser  programmable  opcode  map.  The  third  is  a  subroutine 
stack  the  extends  into  the  external  memory.  These  features  increase  the  scope  of  problems  the 
FPASP  can  handle. 

1.5  Summary  of  Current  Knowledge 

Research  on  the  design  of  integer  ASPs  supplied  some  of  the  macrocells  used  in  the  FPASP. 
The  detailed  operation  of  these  cells  has  been  determined  by  reading  Capt.  Gallagher’s  thesis, 
looking  at  the  cells’  magic  descriptions,  and  by  reverse  engineering  their  logical  structure  from  SIM 
files.  The  SIM  extractor  was  written  by  CPT  Dukes  and  tested  on  some  of  the  cells  from  the  ASP 
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library  [Duk88].  An  integer  ASP  test  chip  which  uses  these  cells  has  been  fabricated  by  MOSIS 
and  is  available  for  testing. 


Microcode  routines  were  written  by  Capt.  Linderman  for  a  single  precision  floating  point 
ASP  based  on  the  Capt.  Gallagher’s  integer  ASP  architecture.  These  routines  were  available  for 
conversion  to  FPASP  microcode.  In  addition  to  these,  routines  written  as  EE588  class  projects 
were  available  to  point  out  the  strengths  and  weaknesses  of  the  FPASP  as  its  design  evolved  from 
Capt.  Linderman ’s  early  design. 

A  large  number  of  cells  also  exist  for  the  Winograd  Fourier  Transform  ( WFT)  processor  chips 
(She86].  These  chips  not  only  provided  pad  cells  for  the  FPASP,  but  also  provided  design  and 
layout  experience  during  a  class  project  which  involved  creating  the  WFT17  chip  from  the  similar 
but  smaller  WFT16  chip. 

Another  chip  laid  out  for  a  previous  thesis  effort  was  a  laser  programmable  comparator 
[Spa86].  That  chip  was  reconfigured  to  be  used  on  a  proposed  ASP  board  for  the  Sun  Microsystems 
workstations  [Jav87].  That  chip  has  been  fabricated  as  a  MOSIS  TinyChip  [MOS88].  Work  on  that 
chip  provided  experience  as  a  background  to  this  thesis  effort. 

1.6  Assumptions 

It  is  assumed  that  the  floating  point  hardware  projects  will  be  completed  in  time  to  integrate 
those  cells  into  the  FPASP  as  a  part  of  this  thesis  effort.  Floorplanning  of  the  FPASP  indicated 
how  much  space  is  available  for  those  cells.  These  estimates  were  based  on  the  completed  core 
adder  array  of  the  floating  point  multiplier.  The  interface  for  the  floating  point  hardware  is  fixed, 
so  if  they  are  not  done  in  time  for  integration  as  part  of  this  effort,  it  will  take  only  a  few  days’ 
work  to  complete  their  integration  with  the  rest  of  the  FPASP. 

Another  assumption  is  that  the  equipment  for  the  laser  programming  will  be  in  place  and 
operating  when  it  comes  time  to  program  the  FPASP.  It  was  assumed  from  the  start  that  laser 
programming  would  work,  based  on  reports  from  Lincoln  Laboratories.  It  is  also  assumed  that 
the  programming  speed  of  the  laser  hardware  will  be  increased  before  any  actual  programming 
of  FPASPs  in  quantity  will  take  place.  Presently,  speeds  proposed  by  Capt.  Tillie  indicate  a 
programming  time  for  256  64-bit  words  of  laser  PROM  of  12  hours. 

It  Is  also  assumed  that  the  memory  chips  and  printed  circuit  board  to  be  used  with  the 
FPASP  will  be  capable  of  supporting  a  single  cycle  read  or  write. 
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1.7  Approach 

1. 7.1  Approach  to  Architectural  Specification.  Figure  1.1  shows  a  hierarchy  of  design 
which  will  be  referred  to  throughout  this  paper.  The  size  of  the  text  indicates  the  areas  of  emphasis 
for  this  thesis.  The  formulation  of  the  FPASP  architecture  followed  neither  a  top-down  approach 
from  software,  nor  a  bottom  up  approach  from  hardware.  The  top-down  approach  cannot  take  into 
account  the  problems  encountered  in  the  design  of  hardware,  which  could  result  in  an  architecture 
that  cannot  be  realized  in  silicon.  On  the  other  hand,  the  bottom  up  approach  from  hardware 
cannot  efficiently  support  a  software  routine  which  has  not  been  written  yet,  which  means  the 
progreunmer  would  have  to  use  whatever  the  designer  decided  was  good  enough  or  what  would  fit 
on  the  chip. 


Figure  1.1.  Theme  Figure. 
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The  approach  chosen  for  the  FPASP  was  an  iterative  one;  microcode  was  written  at  vcirious 
strtges  of  architectural  development.  This  provided  feedback,  pointing  out  hardware  that  was  needed 
but  not  available,  or  hardware  that  was  available  but  was  superfluous  or  not  used  often  enough  to 
justify  the  area  it  occupied.  At  the  same  time,  fioorplanning  of  the  FPASP  was  crirried  out.  This 
provided  size  estimates  of  the  hardware  needed  to  support  the  microcode.  Thus  the  development  of 
the  FPASP  architecture  proceeded  with  feedback  coming  from  both  sides  which  guided  it  towards 
an  efficient  compromise  between  what  would  be  nice  to  have  for  the  software,  and  what  could  be 
fit  onto  a  chip  with  reasonable  yields. 

Throughout  the  evolution  of  the  architecture  and  development  of  the  macrocells,  design  for 
testability  was  included.  The  approach  to  design  for  testability  followed  closely  the  approach  taken 
for  the  control  section  of  the  WFT17  chip.  Emphasis  was  placed  on  making  the  microcode  sequencer 
control  section  more  observable  and  controllable. 

l.l.S  Approach  to  Chip  Design.  Before  actual  cell  design  was  attempted,  a  floorplan 
was  generated  to  get  an  idea  of  size  and  placement  of  the  cells.  The  sizes  of  cells  were  based  on 
existing  ASP  and  LPROM  cells,  and  on  the  completed  portion  of  the  floating  point  multiplier. 

After  floorplanning,  a  search  was  made  of  the  existing  cell  libraries  to  see  which  cells  could 
be  reused  directly  or  with  slight  modification.  Whenever  possible,  cells  which  had  to  be  designed 
for  the  FPASP  were  based  on  existing  cells  or  on  a  set  of  standard  parts  gleaned  from  those  cells. 

Early  on  it  was  obvious  that  the  FPASP  would  need  more  control  bits  than  could  be  efficiently 
held  in  microcode  memory;  so  emphasis  was  placed  on  choosing  microcode  fields  and  encoding 
schemes  which  could  be  decoded  with  a  minimum  of  extra  circuitry.  This  affected  the  design  of  the 
cells  and  the  ordering  of  the  encoded  choices  in  each  microcode  field. 


1.8  Materials  and  Equipment 

This  thesis  effort  made  use  of  the  Berkeley  Unix  CAD  tool  set  [Cal86].  This  included  the 
Magic  layout  tool,  the  Mextra  transistor  netlist  extraction  tool,  and  the  Esim  switch  level  sim¬ 
ulator  [Ter86].  Several  CAD  tools  written  at  AFIT  were  also  used.  These  included  the  CSTAT 
static  design  checker  [Lin85],  the  GMAT  microcode  assembler  [Hau87],  and  the  circuit  extractor 
mentioned  above.  Spice  3  was  used  for  analog  simulation  of  transistors  based  on  models  received 
from  the  fabricator  on  previous  1.2  micron  CMOS  fabrication  runs  [Qua86],  IEEE  1076  standard 
VHDL  was  used  for  the  model  [Fle88]. 
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Hardware  tools  used  included  the  ELXSI  superminicomputers  (BSD  and  ICC),  the  Vax  11/785 
computers  (SSC  and  CSC),  a  Sun  3/160  workstation  (Mercury),  an  AED767  graphics  display  with 
a  Summagraphics  digitizing  pad,  and  laser  printers.  All  of  the  machines  listed  above  were  available 
at  AFIT.  I  also  used  my  home  system,  a  Heathkit  HIOO/IBM  PC  compatible  with  a  1200  baud 
modem. 

The  MOS  Implementation  Service  (MOSIS)  will  be  used  for  the  fabrication  of  the  FPASP, 
with  funding  from  the  Defense  Advanced  Research  Projects  Agency  (DARPA). 

1.9  Order  of  Presentation 

The  following  chapters  cover  different  aspects  of  the  research.  For  each  chapter  a  version  of 
Figure  1.1  is  used  to  show  which  areas  will  be  concentrated  on.  In  these  figures  the  inner  brace 
shows  the  main  area,  and  the  outer  brace  shows  areais  which  are  also  affected  or  which  generated 
feedback. 

Chapter  2  presents  an  analysis  of  the  problems  and  requirements  that  affected  the  FPASP 
architecture  and  VLSI  design.  Chapters  3  through  5  present  the  research  and  how  it  was  carried 
out.  Chapter  3  covers  the  details  of  the  FPASP  architecture  and  how  feedback  from  hardware  and 
software  research  affected  it. 

Chapter  4  examines  the  hardware  designed  to  implement  the  FPASP  architecture.  Block 
diagrams  are  used  to  show  how  the  various  macrocells  work  together  to  embody  the  architecture 
in  a  real  chip  layout.  Microcode  routines  are  presented  to  show  how  these  macrocells  are  designed 
to  support  software  structures. 

Chapter  5  details  the  low-level  VLSI  design  and  modeling  of  the  FPASP.  This  includes  veri¬ 
fication  of  the  design  at  both  the  transistor  and  macro-cell  level. 

Chapter  6  discusses  the  results  of  the  FPASP  project.  The  efficiency  and  variety  of  the 
microcode  written  for  the  FPASP  will  demonstrate  its  usefulness.  The  projected  performance  of 
the  FPASP  will  be  compared  with  the  performance  of  some  other  machines  to  give  an  idea  of  what 
the  computing  power  of  the  FPASP  will  be. 

Finally,  Chapter  7  presents  some  conclusions  about  the  overall  FPASP  thesis  effort,  and  lists 
some  suggestions  for  future  projects  related  to  the  FPASP. 
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II.  Problem  Analysis 


2.1  Introduction 

This  chapter  discusses  the  design  criteria  that  the  FPASP  architecture  had  to  support,  and 
the  technological  constraints  which  affected  the  design  of  the  FPASP  chip.  The  solutions  to  these 
problems  ue  the  subject  of  Chapters  3  through  5. 


2.2  Efficient  Dot  Product  Routine 

The  first  application  for  the  FPASP  chip  will  be  to  perform  digital  signal  processing  in  the 
form  of  Kalman  filtering,  on  a  VMEbus-compatible  board  in  a  Sun  workstation.  Therefore,  the 
FPASP  must  efficiently  support  matrix  algebra,  which  is  used  extensively  in  the  Kalman  filtering 
algorithm. 

A  basic  operation  in  matrix  algebra  is  the  dot  product  of  two  vectors.  An  example  of  a  dot  I 

product  is  shown  below. 

[246]  •  [135]^  =  2»l  +  4*3  +  6*5  =  44 

The  elements  of  each  vector  are  multiplied  together  and  accumulated  into  a  final  result  which  | 

is  a  scalar.  This  routine  is  at  the  core  of  a  larger  routine  for  multiplying  two  matrices  together,  as 
seen  in  the  flowchart  in  Figure  2.1. 

The  matrix  multiply  routine  is  a  double-nested  loop,  with  a  dot  product  in  the  inner  loop. 

This  means  that  for  square  matrices  with  n  vectors,  the  dot  product  routine  must  be  done  n-  I 

times,  with  each  dot  product  performing  n  multiplies  and  n-J  accumulates;  thus  multiplies  and 

n^[n  —  1]  accumulates  must  be  done  for  each  matrix-matrix  multiplication.  The  matrix  multiply 

is  a  bjisic  matrix  algebra  operation;  therefore,  the  speed  of  the  dot  product  routine  will  determine 

the  overall  speed,  within  a  constant  factor,  of  any  algorithm  which  depends  on  it.  ! 

An  efficient  dot  product  routine  was  one  of  the  basic  problems  to  be  solved  by  the  FPASP 
architecture.  It  had  to  provide  the  datapaths  and  functionality  to  perform  these  two  operations, 
multiply  and  accumulate,  on  double  precision  floating  point  numbers,  including  moving  the  data 
on  and  off  the  chip,  in  as  few  clock  cycles  as  possible. 

A  processor  architecture  with  single  precision  floating  point  hardware  was  designed  by  Capt. 

Linderman  for  a  class  project  in  microcode  design.  A  register-level  description  of  this  processor 

appears  in  Figure  2.2.  The  winter  quarter  1988  EE588  dees  project  was  to  write  Kalman  Filter  j 

microcode  for  this  processor. 
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User  routine 
calls  Matrix-Matrix 
multiply 

Passes  pointers  to 
‘A'  and  ‘B’  matrices 


Figure  2.1.  Matrix-Matrix  Multiply  Flow. 
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I  One  important  feature  of  this  architecture  was  specifying  that  the  floating  point  hardware 

'  would  have  two  clock  cycles  to  compute.  The  microcode  must  contend  with  this  pipelined  data 

flow,  but  the  processor  gains  efficiency  by  performing  more  operations  in  each  clock  cycle.  The 
length  of  the  clock  cycle  can  be  decreased  by  giving  the  floating  point  processors  two  clock  cycles 
instead  of  one.  This  allows  two  integer  operations  to  be  performed  since  all  of  the  integer  operations 
can  be  completed  in  the  shorter  cycle  time. 

So,  while  a  floating  point  result  is  being  computed,  the  rest  of  the  processor  can  be  doing 
other  things.  For  example,  the  floating  point  adder  can  be  loaded  in  one  cycle,  and  the  multiplier 
I  in  the  next;  the  third  cycle  can  then  be  used  to  unload  the  adder  and  the  fourth  cycle  to  unload 

the  multiplier.  This  back  and  forth  loading  and  unloading  is  used  by  the  dot  product  routine  to 
keep  both  floating  point  processors  in  continuous  operation;  while  at  the  same  time  computing  the 
addresses  for  the  operands. 

^  2.2.1  Pipelined  Architecture.  Pipelined  structure  such  as  the  two  cycle  computation 

time  for  the  floating  point  hardware  is  common  in  processor  design  where  one  or  more  computations 
take  much  longer  than  basic  processing  functions  such  as  calculating  data  addresses.  In  the  case 
I  of  the  SPASP,  the  delay  is  worked  out  by  the  microcode,  but  this  problem  also  occurs  in  non¬ 

microcode-driven  architectures.  In  those  cases,  pipelining  is  used  when  there  are  more  processing 
circuits  than  there  are  pins  available  to  supply  data  to  each  one  simultaneously. 

An  example  of  a  processor  pipelined  for  the  second  reason  is  shown  in  Figure  2.3  [Gun88]. 

I  The  IQMAC  chip  performs  single  precision  floating  point  arithmetic  for  digital  signal  processing. 

It  uses  a  hardware  pipeline  to  increase  the  number  of  operations  being  performed  on  each  clock 
cycle.  The  chip  uses  1.2  f/m  CMOS  technology. 

With  that  much  hardware  packed  onto  a  chip,  the  number  of  separate  processing  circuits  has 
outstripped  the  packaging  technology.  In  this  case,  the  package  is  a  256  lead  (latpack,  but  there 
are  five  separate  32  bit  arithmetic  processors.  The  hardware  pipeline  allows  the  data  to  be  pa.ssed 
back  and  forth  through  fewer  pins:  the  processors  are  arranged  one  after  the  other,  so  only  the  first 
stage  of  the  pipeline  needs  to  be  loaded,  and  only  the  last  stage  needs  to  drive  out  data. 

Pipelining  has  also  been  used  in  the  SPASP  architecture,  as  seen  in  Figure  2.2.  In  the  SP.VSP 
input  data  is  placed  on  the  A  and  D  busses.  Normally  the  result  would  be  driven  onto  the  C  bus 
in  the  following  clock  cycle  and  loaded  into  a  register.  In  the  dot  product  routine,  a  multiply  and 
addition  must  be  done  in  each  iteration  of  the  loop.  This  is  where  a  change  in  the  architecture 
can  decrease  the  length  of  the  software  ‘pipeline.’  If  the  multiplier  results  can  be  driven  directly 
onto  the  A  or  B  bus,  the  dot  product  accumulated  so  far  can  be  driven  onto  the  other  bus  and  the 
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2.3  Number  Representations 

The  SPASP  circuit  proposed  for  Kalman  filtering  for  the  EE588  class  was  intended  to  be 
fabricated  as  an  ASP  based  on  the  cell  library  created  by  Capt.  Gallagher.  That  library  only 
contained  a  single  precision  fioating  point  multiplier. 

A  class  project  by  Capt.  Fretheim  has  indicated  that  a  double  precision  multiplier  can  be 
maide  as  a  macrocell  for  an  ASP  using  the  1.2  micron  CMOS  technology.  This  will  allow  the  more 
precise  multiplier  to  fit  on  a  chip  with  enough  room  left  over  for  the  processing  circuitry.  The  IEEE 
double  precision  floating  point  number  representation  is  shown  in  Figure  2.4.  The  more  precise 
data  representation  requires  the  ASP  to  support  a  data  width  of  64  bits,  which  is  larger  than  the 
cells  in  the  library  were  sized  to  support. 


Figure  2.4.  IEEE  Double  Precision  Floating  Point  Format. 

2-4  Support  for  Software  Structures 

The  need  to  support  the  most  efficient  double  precision  dot  product  routine  was  one  of  tlie 
main  problems  to  be  solved  by  the  FPASP  architecture,  but  it  was  not  the  only  one.  Since  the 
FPASP  is  intended  to  also  support  many  as  yet  unknown  algorithms,  it  must  contain  enough 
features  to  allow  it  to  perform  a  wide  range  of  functions.  Two  of  these  projected  requirements 
are  support  for  the  software  structures  of  nested  loops  and  recursive  routines.  These  additional 
features  must  not  detract  from  the  efficiency  of  the  Kalman  filtering  routine. 

Matrix  data  is  commonly  stored  in  row-major  order:  successive  memoiy  locations  hold  the 
elements  of  the  array  starting  with  the  first  element  •  f  the  first  row  and  continuing  row  by  row  to 
the  last  element  of  the  last  row.  In  matrix  algebra  routines,  the  data  is  often  accessed  column  by 
column  as  well  as  row  by  row.  In  order  to  efficiently  support  this  addressing  mode,  the  register 
which  holds  the  pointer  to  the  data  must  be  able  to  be  incremented  by  the  number  of  elements  in 


a  row.  Since  the  FPASP  must  support  generalized  matrix  algebra  routines,  this  increment  value 
must  be  easily  varied. 


In  some  microcode  controlled  processors  such  as  the  ASP,  calls  to  a  subroutine  cause  tlie 
return  address  to  be  saved  onto  a  register  stack.  The  depth  of  this  stack  determines  how  many 
calls  CM  be  made  before  the  machine  can  no  longer  support  that  software  structure.  Some  software 
programs  require  a  routine  to  keep  calling  itself  until  a  condition  is  met.  Such  recursive  routines 
require  a  return  address  to  be  put  on  the  stack  for  each  such  call,  especially  if  the  recursively  called 
routine  calls  on  other  routines.  A  feature  of  recursive  software  code  is  that  the  number  of  calls  is 
unknown  to  the  calling  routine.  The  cells  designed  for  the  ASP  cell  library  limit  the  depth  of  the 
stack  to  the  amount  of  space  available  on  the  chip  for  stack  registers.  For  long  recursion,  chains  it 
would  be  impractical  to  build  a  stack  deep  enough.  The  FPASP  must  be  therefore  be  able  to  save 
the  stack  externally  when  it  gets  full. 


5.5  Computer-Aided  Design  Ibestrictions 

To  design  and  lay  out  a  chip  as  complex  as  the  FPASP  requires  the  use  of  software  tools  to 
verify  the  design  and  to  draw  the  actual  layout  of  the  transistors.  Such  tools  are  part  of  the  .\FIT 
Computer  Aided  Design  (CAD)  environment.  Some  of  the  tools  which  proved  most  useful  to  tlie 
FPASP  design  process  also  placed  some  restrictions  on  the  design.  Two  of  the  tools  which  had  this 
effect  were  the  GMAT  and  Esim  tools.  The  GMAT  tool  converts  microcode  to  binary  machine 
code,  and  the  Esim  tool  simulates  a  circuit  at  the  switch  level.  The  limitations  of  these  two  tools 
directly  affected  the  architecture  of  the  FPASP  and  the  design  of  the  FPASP  circuitry. 

The  first  limitation  is  that  Esim  cannot  simulate  circuits  which  contain  certain  feedback  loops. 
Ongoing  research  projects  at  A  FIT  are  solving  this  problem.  The  FLxrom  and  Nofeed  programs 
get  around  the  feedback  problem  by  replacing  the  unsimulatable  circuit  description  with  one  that 
contains  no  feedback  loops.  But  this  type  of  fix  is  limited  to  those  circuit  configurations  which 
these  programs  are  designed  to  recognize;  so  the  basic  problem  with  Esim  remains  and  shows  up 
when  new  circuit  designs  are  simulated.  This  problem  must  be  kept  in  mind  when  circuits  are 
designed  because  a  logically  correct  circuit  that  cannot  be  simulated  is  useless.  Simulation  of  the 
completed  FPASP  circuit  will  be  required  before  it  is  sent  off  for  fabrication. 

The  second  restriction  is  the  maximum  length  of  a  microcode  word  that  GM.\T  can  handle, 
which  is  128  bits.  This  was  sufficient  for  the  ASP  prototype  chip,  which  has  a  microcode  word 
only  51  bits  wide.  The  ASP  was  only  a  proof-of-  concept  prototype,  so  it  had  few  control  signals 
and  the  GMAT  restriction  had  no  effect.  51  bits  allowed  the  ASP  controller  to  be  very  horizontal: 
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almost  every  control  line  had  a  microcode  bit  assigned  to  it  [Man82].  This  eliminated  most  of  the 
decoding  that  would  then  have  to  be  done  outside  of  the  microsequencer. 

GMAT  does  not  support  the  number  of  control  signals  required  for  a  large  processor,  so 
the  horizontal  design  style  cannot  be  used.  This  problem  has  two  effects  on  the  FPASP  circuit: 
more  decoding  circuitry  must  be  designed  and  laid  out,  and  not  all  of  the  possible  combinations  of 
functions  for  each  macrocell  can  be  represented  in  the  microcode  fields. 

2.6  CMOS  Technology  Limitations 

Most  of  the  limitations  on  circuit  design  are  due  to  the  CMOS  technology,  which  is  the  only 
one  supported  by  the  AFIT  CAD  environment  at  this  time.  For  example,  the  resistance  of  the 
channel  formed  under  the  gate  of  a  MOS  transistor  limits  the  amount  of  current  the  device  can 
pass.  This  gain  is  partly  dependent  on  the  fabrication  process;  but  the  designer  decides  on  the 
sizes  of  the  transistors  and  must  therefore  choose  the  sizes  that  best  meet  the  circuit  requirements. 
The  sizing  of  transistors  and  the  use  of  the  Spice  program  to  simulate  those  choices  are  covered  in 
chapter  5. 

Some  limitations  are  due  to  the  Very  Large  Scale  Integrated  circuit  (VLSI)  concept  itself. 
This  is  the  concept  of  putting  as  much  circuitry  as  possible  onto  one  chip  to  increase  the  speed  and 
decrease  the  number  of  packages  needed  to  perform  a  particular  function.  The  number  of  devices 
on  a  chip  has  been  doubling  every  two  years,  but  the  number  of  pins  available  per  chip  has  not  kept 
up  that  pace.  The  result  is  that  the  circuitry  on  the  chip  becomes  less  accessible  from  the  outside. 

This  decrease  in  observability  and  controllability  of  VLSI  circuits  has  increased  the  problem 
of  testing  them.  This  problem  can  be  eased  by  adding  extra  circuitry  to  the  design  to  increase  the 
number  of  observable  ^lnd  controllable  points  in  the  circuit  [Fuj85].  This  design-for-testability  (dL) 
problem  must  be  addressed  in  the  FPASP  architecture  and  design. 


III.  Architecture 


3.1  Introduction 

The  FPASP  design  process  began  with  the  architectural  specification  and  a  proposed  mi¬ 
crocode  instruction  set.  This  corresponds  to  the  center  of  the  diagram  shown  in  Figure  3.1.  The 
architecture  evolved  as  feedback  was  generated  from  continuous  research  into  both  the  software 
Md  hardware  of  the  FPASP.  This  feedback  allowed  the  final  design  to  meet  the  needs  of  both 
sides,  above  and  below  the  architecture  level.  Specific  examples  of  how  these  feedbeicks  affected 
the  architecture  will  be  shown  throughout  this  chapter.  The  result  is  a  machine  that  solves  the 
problems  discussed  in  Chapter  2. 


Figure  3.1.  Design  Areas  Covered  in  this  Chapter 
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3.2  Data  Representations 


Research  for  this  thesis  began  after  the  decision  wras  made  to  go  with  the  double  precision 
hardware  for  the  Kalman  filter  ASP.  The  new  ASP  is  called  the  Floating  Point  ASP  (FPASP) 
to  distinguish  it  from  the  integer  ASP  chip  designed  by  Capt.  Gallagher,  and  Capt.  Linderman’s 
single  precision  ASP  (SPASP)  architecture. 

The  first  problem  raised  by  going  to  double  precision  was  how  to  represent  64  bit  data  inside 
the  FPASP.  Since  the  existing  ASP  cells  are  bit-slice,  the  easiest  solution  would  be  to  have  the  data 
paths  64  bits  wide.  But  this  means  using  64  bit  data  even  when  not  doing  floating  point  operations. 

Most  of  the  integer  operations  done  in  the  microcode  to  support  the  floating  point  operations 
set  up  loops  and  calculate  addresses.  These  operations  do  not  require  64  bit  words,  and  the  extra 
bits  just  slow  down  the  hardware.  Fur  example,  the  addresses  used  by  the  FPASP  are  20  bits  long, 
so  there  is  no  real  need  for  a  64  bit  word  from  the  addressing  point  of  view.  Also,  the  first  intended 
use  for  the  FPASP  is  in  a  host  processor  with  a  32  bit  bus,  so  32  bits  is  a  good  choice  from  an 
interfacing  point  of  view. 

A  64  bit  integer  representation  is  also  difficult  to  support  in  hardware.  For  example,  the  ALUs 
in  the  FPASP  use  a  carry-select  adder.  The  ripple  delay  through  this  adder  is  the  worst  case  delay 
in  the  ALU;  extending  the  adder  to  64  bits  would  slow  down  the  entire  machine’s  clock  frequency. 
So  using  64  bit  integers  would  not  be  a  good  design  choice.  Also,  with  two  32  bit  datapaths  there 
is  the  possibility  of  doing  two  integer  operations  in  the  same  clock  cycle. 

The  32  bit  integer  representation  was  chosen  for  the  FPASP,  along  with  64  data  paths  on  and 
off  the  chip.  The  mapping  of  the  double  precision  format  onto  the  two  FPASP  datapaths  is  shown 
in  Figure  3.2. 


3.3  General  Architectural  Features 

Given  the  32  bit  internal  data  representation  and  64  bit  I/O  width,  there  can  be  two  separate 
datapaths  in  the  FPASP  architecture:  upper  and  lower,  both  32  bits  wide.  To  support  this  ar¬ 
chitecture,  the  external  memory  was  split  into  upper  and  lower  halves,  each  with  its  own  separate 
address  lines  from  the  FPASP.  The  beisic  architecture  thus  became  two  SPASPs  which  work  to¬ 
gether  to  manipulate  floating  point  numbers,  or  separately  to  do  software  support  such  as  looping 
and  branching. 

To  get  the  two  32  bit  datapaths,  the  SPASP  register  set  and  ALU  were  mirrored  on  each 
side.  This  simplified  the  physical  layout  and  the  design  of  the  chip;  an  important  factor  given  the 
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limited  amount  of  time  available  for  layout  and  design  verification.  The  mirror  image  attribute  of 
the  FPASP  can  be  seen  in  the  register-level  description  in  Figure  3.3,  and  in  the  floorplan  of  the 
chip.  A  simplified  floorplan  is  shown  in  Figure  3.4. 

Not  all  of  the  hardware  was  mirrored,  however.  Some  of  the  hsudware  is  only  needed  on  one 
or  the  other  datapath,  or  a  piece  of  hardware  may  be  so  large  that  only  one  can  be  afforded  in  the 
FPASP.  For  example,  the  function  ROM  only  needs  to  access  to  most  significant  bits  of  a  floating 
point  word,  and  so  it  appears  on  the  upper  datapath.  There  is  also  only  one  barrel  shifter,  which  is 
a  large  macrocell.  The  barrel  shifter  is  useful  only  for  non-floating  point  data.  The  bus  ties  allow  it 
to  be  shared  by  either  datapath.  However,  using  the  barrel  shifter  from  the  upper  datapath  limits 
the  amount  of  processing  which  can  be  done  on  the  lower  datapath  on  that  clock  cycle,  since  the 
lower  and  upper  busses  are  tied  together. 
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Figure  3.3.  FPASP  Register-level  Description. 
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Figure  3,4.  Basic  FPASP  Floorplan. 
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3.4  Bus  Architecture 

One  architectural  feature  retained  from  the  SPASP  was  the  overlapping  register  set.  This 
allows  addresses  or  data  to  be  stored  in  any  register  and  moved  to  either  the  MAR  or  the  processing 
hardware  without  going  through  an  intermediate  register.  The  overlapped  structure  is  shown  in 
the  register-level  description  of  the  FPASP  in  Figure  3.3. 

The  A  and  B  busses  supply  input  data  to  the  processing  hardware,  and  results  are  returned 
on  the  C  busses.  The  A  bus  can  be  driven  by  any  register  except  the  Memory  Address  Registers 
(MARs)  which  only  drive  out  to  pads.  The  B  busses  can  only  be  driven  by  the  general  purpose 
data  registers,  the  incrementable  data  registers,  and  the  floating  point  processors.  The  C  busses 
return  data  to  any  register,  including  the  MARs. 

The  registers  which  are  overlapped  by  the  A,  but  not  the  B  bus  have  their  own  “E”  bus.  The 
E  busses  carry  only  addresses,  and  so  are  only  20  bits  wide.  The  registers  which  can  drive  the  E 
bus  must  also  drive  the  A  bus,  so  they  are  all  32  bits  wide.  The  exceptions  to  this  are  the  MARs 
which  only  hold  20  bit  addresses,  never  data.  When  the  MARs  load  from  the  C  bus  they  take  in 
only  the  20  least  significant  bits  (LSBs). 

This  overlapped  architecture  can  sometimes  double  the  number  of  transfers  that  can  occur  in 
a  single  instruction.  For  example,  data  can  be  sent  to  the  ALU  on  the  A  and  B  busses  at  the  same 
time  the  pointer  registers  are  sending  addresses  to  the  MARs  via  the  E  bus.  The  entire  register 
set  remains  accessible  via  the  A  bus  if  the  processors  need  data  from  the  pointers  on  another  clock 
cycle. 

Another  feature  of  the  architecture  allows  both  datapaths  to  share  data  or  processing  re¬ 
sources;  the  upper  and  lower  busses  can  be  tied  to  their  counterpart.  In  this  way,  registers  drive  to 
or  load  from  busses  on  the  other  half  of  the  FPASP.  The  disadvantage  of  using  the  busses  this  way 
is  that  they  are  then  ‘tied  up’  doing  one  operation;  the  other  datapath  cannot  use  the  tied  busses 
for  its  own  data  transfers. 

The  ability  to  tie  the  E  busses  becomes  most  important  when  the  FPASP  is  reading  or  writing 
floating  point  numbers.  The  most  convenient  way  to  store  the  64  bit  floating  point  numbers  is  to 
put  them  into  the  same  address  in  each  of  the  external  memory  banks.  To  get  the  same  address 
into  both  MARs  does  not  require  that  a  pointer  on  each  side  hold  the  same  information.  The  E 
busses  can  be  tied  together  and  driven  from  a  single  pointer  register,  or  an  address  computed  in 
the  datapath  of  one  side  can  be  driven  onto  the  tied  C  busses  and  loaded  into  both  MARs, 
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Thus,  the  FPASP  bus  architecture  supports  three  different  configurations  depending  on  how 
the  busses  are  tied.  Each  of  the  datapaths  can  work  separately  on  data  in  their  own  registers;  or 
they  can  share  data  from  each  other’s  A  and/or  B  busses  and  return  results  to  either  one  on  tied  C 
busses;  or  the  two  datapaths  can  transfer  floating  point  numbers  to  and  from  the  external  memory 
banks  with  only  the  E  busses  tied  together. 

The  first  version  of  the  architecture  contained  only  the  A  and  E  bus  ties.  Feedback  from  the 
microcode  written  by  the  students  in  the  summer  quarter  “Introduction  to  Computer  Architecture” 
(EE588)  for  a  preliminary  version  of  the  FPASP  pointed  out  the  need  for  more  interconnectivity 
to  relieve  dataflow  bottlenecks,  especially  on  the  A  bus.  At  first,  the  additional  bus  ties  were  not 
easily  supportable  in  the  hardware  due  to  the  space  the  busses  would  take  up  crossing  the  chip.  A 
revised  floorplan  which  put  both  floating  point  processing  circuits  on  the  left  side  of  the  chip  forced 
such  a  crossover  to  be  made;  so  adding  the  other  bus  ties  no  longer  required  additional  space.  The 
floorplanning  of  the  chip  is  discussed  in  Chapter  4. 

The  only  other  bus  on  the  chip  which  is  accessible  to  more  than  one  register  is  the  D  bus. 
This  bus  runs  from  the  Memory  Buffer  Registers  (MBRs)  to  the  data  pads.  The  D  bus  also  goes 
to  the  R1  and  R2  registers,  which  can  only  load  from  the  bus.  The  additional  registers  on  the  D 
bus  relieve  an  I/O  bottleneck. 

The  D  bus  is  an  internal  bidirectional  bus  which  can  be  driven  by  MBRs  for  a  data  write  to 
memory,  or  driven  by  the  data  pads  for  a  memory  read.  The  direction  of  data  flow  is  controlled 
by  the  Write  Enable  Bar  control  bits.  Each  data  path  has  its  own  control  bit  to  the  corresponding 
half  of  the  external  memory,  so  integer  transfers  can  be  independently  controlled.  A  “1”  in  this 
control  bit  field  puts  the  data  pads  in  the  input  mode,  which  is  the  default  value. 


3.5  Register  Types 

3.5.]  General  Purpose  Registers.  Five  bits  were  allocated  for  controlling  access  to  the 
A,B  and  C  busses.  This  allows  32  choices  for  access  to  these  busses.  So  in  the  case  of  the  A  bus: 
of  the  32  possible  choices,  1  is  used  for  the  NOP,  3  for  the  incrementable  registers,  two  for  the 
pointers  and  one  for  the  memory  buffer  register,  leaving  25  for  general  purpose  registers. 

The  reason  for  using  up  ail  of  the  leftover  choices  on  general  purpose  registers  is  the  increase 
in  efficiency  provided  by  a  large  register  set  [Hen84]  [Mit86].  This  is  because  the  processor  does 
not  lose  a  clock  cycle  getting  the  data  through  the  memory  buffer  registers,  and  it  does  need  to 
calculate  all  of  the  addresses  for  the  data. 
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The  choice  of  5  bits  for  the  register  selection  microword  field  is  based  on  the  size  of  the  registers 
and  the  size  of  the  decoders.  The  registers  laid  out  for  the  ASP  chip  were  used  to  floorplan  the 
FPASP  chip.  Feedback  from  this  floorplan  indicated  that  more  than  32  registers  in  each  datapath 
would  increase  the  size  of  the  chip  past  the  350  square  mil  (0.35  sq.  in.)  limit  set  for  cost  and 
fabrication  yield  reasons. 

3.5.2  Incrementable  Registers.  Loops  are  one  of  the  common  software  structures  found 
in  the  microcode  written  for  matrix  algebra  on  the  SPASP,  and  also  most  other  types  of  programs. 
The  registers  chosen  to  support  this  feature  are  incrementable  data  registers,  each  with  its  own 
half-stdder  circuitry  and  zero  flag.  It  can  increment  without  using  any  other  hardware  resources 
such  as  the  busses  or  ALU.  This  is  a  critical  feature,  since  it  leaves  those  resources  free  for  useful 
computation,  decreasing  the  number  of  clock  cycles  needed  in  each  loop. 

To  perform  a  loop,  the  user  has  only  to  load  an  incrementable  register  with  the  negative  value 
of  the  number  of  loop  iterations.  On  each  iteration  of  the  loop,  the  register  is  given  the  increment 
command  and  the  flag  is  checked.  On  the  last  iteration,  the  incrementer  has  counted  up  to  zero 
and  the  zero  flag  is  set.  To  exit  the  loop,  one  of  the  instructions  in  the  loop  is  a  conditional  branch 
based  on  the  value  of  the  incrementer’s  flag. 

The  FPASP  has  a  total  of  six  incrementable  registers,  three  on  each  datapath.  The  original 
SPASP  had  only  two  registers,  and  mirroring  that  in  the  FPASP  made  it  four.  Some  of  the 
microcode  written  for  the  SPASP  indicated  that  even  four  might  not  be  enough,  so  an  extra  one 
was  added  to  each  datapath.  Thus,  the  FPASP  can  directly  support  nested  loops  up  to  a  depth  of 
six.  After  that,  the  loop  counter  must  be  incremented,  and  the  zero  condition  checked,  by  using 
the  ALU  INC  operation. 

3.5.3  Pointer  Registers.  The  matrix  addressing  problem  was  also  solved  by  using  regis¬ 
ters  that  had  their  own  dedicated  arithmetic  circuitry.  When  a  matrix  is  stored  in  row-major  order, 
it  is  easy  enough  to  retrieve  the  data  in  each  row  by  incrementing  an  address  pointer  by  one;  but 
some  operations  require  that  the  data  be  retrieved  in  a  different  order,  such  as  reading  a  column 
of  the  matrix. 

A  simple  increment-by-1  is  no  longer  enough  to  produce  the  required  address  in  one  clock 
cycle.  The  FPASP  provides  a  register  to  hold  the  increment  amount  and  a  dedicated  adder  to 
perform  the  increment.  This  keeps  the  pointer  increment  operation  from  taking  up  the  ALU 
resources,  as  with  the  incrementable  registers  mentioned  above.  An  increment  holding  register  is 
provided  for  each  pointer,  and  it  can  only  load  from  the  C  bus. 
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The  primary  purpose  of  the  pointer  registers  is  to  compute  and  hold  addresses;  but  the 
overlapping  bus  architecture  allows  the  pointer  registers  to  communicate  with  both  the  data  pro¬ 
cessing  hardware  and  the  address  registers.  Therefore,  in  addition  to  holding  and  calculating  20 
bit  addresses,  the  pointers  can  also  hold  and  increment  32  bit  data. 

3.5.3. 1  Additional  Increment  Hardware.  Originally,  the  architecture  called  for  a 
full  adder  for  each  pointer  register.  This  is  a  case  where  feedback  from  both  the  hardware  and 
software  research  worked  to  make  the  architecture  more  efficient.  To  perform  a  32  bit  addition  in 
one  clock  cycle  required  a  carry-select  adder.  SPICE  simulations  of  the  carry-select  adder  in  the 
ALU  showed  that  this  operation  could  be  done  in  12  nanoseconds.  This  wEts  fast  enough  to  meet 
the  one  clock  cycle  requirement,  but  floorplanning  indicated  that  such  an  adder  would  be  too  large 
to  aflbrd  having  one  for  each  pointer  register. 

The  architecture  was  altered  so  that  the  two  pointers  on  each  datapath  would  share  a  single 
adder,  while  each  kept  its  own  increment  register.  For  the  microcode  written  for  the  EE588  projects, 
not  having  both  pointers  incrementable  on  the  same  clock  cycle  did  not  increase  the  number  of 
lines  of  code.  The  need  to  increment  both  pointers  on  the  same  clock  cycle  did  not  arise  often,  and 
when  it  did  the  second  increment  could  be  moved  to  another  existing  line  of  code. 

In  a  continuation  of  this  architectural  style,  the  incrementable  registers  were  also  set  up  to 
share  a  single  adder.  Not  only  was  this  supposed  to  decrease  the  layout  area,  but  it  would  also  save 
a  bit  in  the  microcode  word  for  each  datapath,  which  had  reached  the  128  bit  limit  that  could  be 
supported  by  the  GMAT  CAD  tool.  Again,  the  microcode  showed  that  multiple  increments  could 
be  avoided  without  Eidding  a  line  of  code. 

Spice  simulation  results  from  the  ALU  adder  also  indicated  that  the  simpler  circuitry  of  the 
incrementing  adder  would  be  feist  enough  even  if  the  carry  had  to  ripple  down  the  entire  32  bits. 
A  simple  ripple  adder  incrementer  is  small  enough  to  allow  one  for  each  incrementable  register. 
Having  the  registers  share  a  single  incrementer  would  have  added  to  the  complexity  of  layout 
without  saving  very  much  space. 

So  each  incrementer  has  its  own  adder,  despite  the  software  feedback,  but  the  condensed 
control  word  was  retained  to  meet  the  requirements  of  the  GMAT  tool.  Since  all  of  the  register 
hardware  exists  on  both  the  upper  Eind  lower  datapaths,  two  increments  can  still  be  done  in  one 
clock  cycle  for  both  the  incrementable  registers  and  the  pointer  registers. 

3.5.4  Memory  Buffer  Registers.  The  Memory  Buffer  Registers  (MBRs)  drive  data  to 
the  pads  via  the  D  bus.  They  also  were  initially  the  only  registers  to  read  data  in  from  the  pads, 
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but  that  architectural  feature  changed  twice  due  to  feedback  from  the  software.  The  first  change 
was  made  on  the  SPASP  to  allow  the  general  purpose  register  R1  to  also  load  data  from  the  D 
bus  to  ease  a  dataflow  bottleneck  in  the  dot  product  routine.  Data  could  then  be  read  from  the 
external  memory  into  Rl,  and  in  the  same  instruction,  results  could  be  loaded  into  the  MBR  for 
writing  out  in  the  next  clock  cycle. 

The  role  of  the  MBR  was  further  spread  out  to  include  the  R2  registers  when  a  similar 
bottleneck  arose  in  code  written  by  Capt.  Linderman  for  supporting  complex  arithmetic.  Again 
there  were  conflicts  using  the  MBR  for  incoming  data  as  well  as  outgoing  data.  Now  three  registers 
have  a  direct  interface  with  the  memory.  The  MBR  architecture  changes  caused  by  this  software 
feedback  were  checked  against  the  hardware  layout  to  assure  that  they  were  easily  implementable. 

3.5.5  Memory  Address  Registers.  Access  to  the  external  memory  is  controlled  by  one 
address  register  on  each  datapath.  The  Memory  Address  Registers  (MARs)  hold  the  addresses  to 
be  driven  off  of  the  chip.  These  two  registers  (upper  and  lower)  can  load  addresses  from  three 
sources.  The  normal  source  is  from  the  pointer  registers  via  the  E  busses.  The  MAR  can  also  load 
addresses  from  the  20  LSBs  of  the  C  bus,  allowing  any  register  to  be  used  to  hold  an  address.  It 
cai;  also  load  the  incremented  value  of  its  present  contents. 

3. 6  Processing  Components 

The  FPASP  is  intended  to  support  a  variety  of  application  specific  algorithms  without  a 
change  of  hardware  and  without  external  co-processors.  The  choice  of  processing  components  in¬ 
cluded  in  the  FPASP  architecture  reflects  this  approach  to  ASP  design.  The  processing  component 
set  must  be  generic  enough  to  meet  the  needs  of  a  wide  range  of  computations.  The  FPASP  pro¬ 
cessing  components  support  three  types  of  computation;  logic  operations,  32  bit  integer  arithmetic, 
and  64  bit  IEEE  double  precision  floating  point  arithmetic.  It  can  also  perform  parallel  operations 
on  shorter  integers.  For  example,  the  32  bit  datapath  can  be  organized  as  four  8  bit  integers.  This 
way  four  operations  can  be  performed  at  once,  as  long  as  a  carry  out  of  one  8  bit  integer  operation 
does  not  affect  the  next  most  significant  8  bit  integer.  To  avoid  this  overflow  problem  16  bits  can 
be  used  for  each  integer.  Using  this  approach,  the  FPASP  could  perform  4  8  bit  by  8  bit  multiplies 
by  using  the  32  bit  integer  multiply  option  of  the  floating  point  multiplier. 

3.6.1  Integer  Arithmetic  and  Logic  Units.  The  FPASP  has  two  independent  Arith¬ 
metic  and  Logic  Units  (ALUs),  one  on  each  datapath.  This  follows  the  basic  mirror  image  archi¬ 
tectural  style  applied  to  the  register  sets.  These  two  units  perform  the  logic  and  most  of  the  32  bit 
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integer  arithmetic  operations.  The  one  integer  operation  not  performed  by  the  ALUs  is  the  32  bit 
multiply,  which  is  done  by  the  floating  point  multiplier.  The  ALU  and  the  linear  shifter  are  shown 
in  Figure  3.5. 

The  choice  of  ALU  functions  was  initially  based  on  the  ASP  library  ALU  cells.  These,  in 
turn,  were  based  on  the  ALU  functions  of  the  basic  processor  in  the  Mano  text  [Man82].  The  ASP 
ALU  was  extended  for  the  FPASP  to  include  the  inverse  logic  functions  NAND  and  NOR.  This 
extension  not  only  increased  the  variety  of  operations  available,  but  also  decreased  the  amount  of 
decoding  hardware  needed  to  control  the  ALU.  This  is  an  instance  where  the  hardware  research 
made  the  architecture  more  efficient  and  more  flexible. 

The  ALU  functions  supported  by  the  FPASP  are  listed  in  Table  3.1.  The  FPASP  can  sup¬ 
port  operations  on  integers  longer  than  32  bits  with  the  functions  “ADd  with  Carry”  (adc)  and 
“Subtract  With  Borrow”  (swb).  These  functions  allow  the  present  operation  to  factor  in  the  carry 
or  borrow  generated  by  the  previous  operation.  A  borrow  is  the  inverse  of  the  carry  out  saved  from 
the  previous  operation,  so  the  same  flip-flop  can  be  used  for  both  the  saved  carry  or  borrow. 


Function 

Value  Passed  to  Shifter 

Flags  affected 

CARRY,  OVERFLOW,  SIGN,  ZERO 

MOVN 

A 

none 

OR 

A  OR  B 

zero 

AND 

A  AND  B 

zero 

XOR 

A  XOR  B 

zero 

MOV 

A 

zero 

NAND 

A  NAND  B 

zero 

NOR 

A  NOR  B 

zero 

NOT 

NOT  A 

zero 

INC 

A  0  -f  1 

All  Four 

SET 

A  -f-  B  -&  1 

All  Four,  Sets  carry 

ADC 

A  -f  B  +  previous  carry 

All  Four 

ADD 

A  -/■  B  0 

All  Four 

NEGA 

A  -f  0  +  1 

All  Four 

SUB 

A  -h  B  +  1 

All  Four 

SWB 

A  -h  B  +  previous  borrow 

All  Four 

DEC 

A  ->■  1  +  0 

All  Four 

Table  3.1.  ALU  Operations. 


The  two  ALUs  generate  separate  flags  which  are  stored  in  flip-flops,  and  can  be  used  as 
branch  conditions  by  the  microsequencer.  These  flags  are  the  ones  available  in  most  processors: 
the  carry/borrow  out,  an  overflow  condition,  a  zero  result  condition,  and  the  two’s  complement 
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sign  of  the  result.  The  flip-flop  adds  a  one  clock  cycle  delay  from  when  the  flag  is  generated  to 
when  it  can  be  used  as  a  branch  condition. 

3.6.2  Bidirectional  Linear  Shifters.  The  result  of  the  ALU  operation  feeds  through  a 
linear  shifter  before  reaching  the  C  bus.  The  linear  shifter  can  perform  a  one  bit  shift  in  either 
direction.  The  shifts  which  can  be  performed  are  listed  in  Table  3.2.  A  block  diagram  of  the  linear 
shifter  is  shown  in  Figure  3.5. 

The  type  of  shift  performed  depends  on  the  source  of  the  bit  shifted  in,  also  shown  in  the 
table.  The  shifts  include  arithmetic  shifts,  logical  shifts,  circular  shifts,  and  shifts  using  stored  bits 
from  the  previous  ALU  operation.  The  bit  shifted  out  is  saved  in  a  flii>-flop,  and  it  can  be  chosen 
as  the  shifted-in  bit  on  the  next  clock  cycle.  The  flip-flop  for  the  shift  out  bit  is  loaded  whenever 
a  left  or  right  shift  is  performed. 


Function 

Type  of  Shift 

Bit  Shifted  In 

NOP 

Shifter  does  not  drive  C  bus 

none 

GNDC 

Shifter  grounds  C  bus 

none 

PASS 

No  shift,  ALU  output  goes  on  C  bus 

none 

SLOT 

Chained  Left  shift 

previous  shift-out  bit  into  LSB 

SLMS 

Circular  Left  shift 

MSB  circulated  into  LSB 

SLCY 

Shift  Left  with  Carry 

Carry  of  present  ALU  operation 

SLO 

Shift  Left  with  Zero 

0  into  LSB 

SLl 

Shift  Left  with  One 

1  into  LSB 

SRLS 

Circular  Right  shift 

LSB  circulated  into  MSB 

SRCF 

Shift  Right  with  previous  Carry 

Carry  flag  into  MSB 

SRS 

Shift  Right  with  previous  Sign 

Sign  flag  into  MSB 

SROT 

Chained  Right  Shift 

previous  shift-out  bit  into  MSB 

SRSE 

Arithmetic  Right  shift 

MSB  extended 

SRCY 

Shift  Right  with  Carry 

Carry  of  present  ALU  operation 

SRO 

Shift  Right  with  Zero 

0  into  MSB 

SRI 

Shift  Right  with  One 

1  into  MSB 

Table  3.2.  Shifter  Functions. 

The  inclusion  of  a  linear  shifter  reflects  the  goal  of  the  FPASP’s  architectural  specification 
process:  support  a  wide  range  of  operations.  An  example  of  how  this  process  builds  on  itself 
was  seen  in  the  feedback  from  the  EE588  projects.  One  of  the  groups  used  the  linear  shifter  in 
their  microcode  and  requested  that  the  bit  shifted  out  be  made  available  as  a  branch  condition. 
Hardware  designed  up  to  that  point  had  not  used  up  all  of  the  available  flag  inputs,  so  the  bits 
from  the  upper  and  lower  shift-out  flip-flops  were  added. 
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Another  feature  of  the  shifter  allows  simple  initialization  of  a  register  to  zero.  The  command 
GNDC  causes  the  shifter  to  ground  the  C  bus  to  all  zeroes,  which  can  then  be  loaded  by  any  register. 
This  leaves  the  A  and  B  busses  free  to  transfer  data  to  the  floating  point  hardware,  or  to  the  ALU 
for  an  operation  that  is  only  intended  to  set  the  flags. 

S.6.S  Barrel  Shifter.  For  shifts  longer  than  one  bit,  a  barrel  shifter  is  included  in  the 
FPASP  architecture.  This  device  can  perform  a  left  circular  shift  of  1  to  31  bits  in  a  single  clock 
cycle.  The  barrel  shifter  takes  its  input  from  the  A  bus  and  puts  the  result  on  the  C  bus.  The 
barrel  shifter  cannot  pass  the  input  straight  through,  the  ALU/Shifter  must  be  used  for  that. 

Since  the  barrel  shifter  is  only  useful  for  integers,  only  one  is  provided.  It  has  been  placed  on 
the  lower  datapath.  This  placement  does  not  restrict  its  use  to  the  lower  datapath  cilone  however: 
the  bus  ties  described  above  allow  the  shifter  to  take  in  bits  from  the  upper  A  bus  and  return 
the  result  to  the  upper  C  bus.  This  interconnectivity  frees  up  space  on  the  upper  datapath  for  a 
different  piece  of  hardware,  increasing  the  flexibility  of  the  FPASP  architecture. 

As  with  the  linear  shifter,  the  barrel  shifter’s  flexibility  was  increased  by  feedback  from  the 
EE588  software  projects.  Use  was  made  of  all  the  FPASP  features  defined  for  those  projects,  and 
almost  always  the  result  was  a  need  for  an  extension  of  those  features.  In  the  case  of  the  barrel 
shifter  it  was  the  need  to  choose  the  length  of  the  shift,  based  on  a  result  in  the  datapath.  The 
architectural  solution  was  to  allow  a  choice  between  the  microword  control  bits  or  bits  stored  off  of 
the  lower  C  bus  from  any  previous  operation. 

The  barrel  shifter  can  be  made  to  perform  a  linear  shift  rather  than  a  circular  shift  by  setting 
the  bits  that  will  be  shifted  out  of  the  left  side  to  zero.  This  can  be  done  by  inserting  a  zero  on 
top  of  those  bits  with  the  literal  inserter,  which  is  described  in  the  next  section.  This  is  possible 
because  the  literal  inserter  drives  the  A  bus,  so  it  can  affect  the  data  being  fed  into  the  barrel 
shifter.  Driving  two  values  onto  the  A  bus  results  in  an  AND  operation,  so  by  driving  the  register 
contents  as  well  as  the  literal,  the  bits  on  the  bus  can  be  masked  to  zero. 

3.6.4  Literal  Inserter.  A  literal  inserter  is  supplied  in  the  FPASP  to  allow  bits  of  the 
microcode  word  to  be  used  as  data.  The  literal  insertion  feature  allows  constants  to  be  stored 
directly  in  the  microcode  rather  external  memory.  This  makes  the  FPASP  more  efficient  in  several 
ways:  no  clock  cycles  are  wasted  reading  in  the  data  or  calculating  its  address,  a  memory  location 
is  not  wasted  on  a  constant,  and  the  registers  are  not  affected  so  data  stored  there  need  not  be 
saved  out  to  memory. 
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I  The  size  of  the  literal  field  in  the  microcode  word  is  16  bits,  half  the  width  of  the  datapath. 

This  width  is  a  compromise  between  the  size  of  the  microcode  field  and  the  width  of  the  datapath. 

To  supply  a  full  32  bits  from  the  microcode  would  waste  too  many  of  the  available  bits.  The 
maiximum  width  of  the  microcode  word  is  limited  by  the  capabilities  of  the  GMAT  program  and 
I  also  by  the  size  of  the  microcode  ROM.  Extra  bits  in  the  ROM  add  to  its  area  and  access  time. 

These  fau:tors  are  covered  in  more  detail  in  other  sections. 

The  literal  inserter  places  the  data  onto  the  16  LSBs  or  MSBs  of  either  A  bus,  where  it  is 
immediately  available  as  data  to  the  processing  hairdware.  If  the  literal  is  placed  on  the  lower  bus 
I  LSBs  it  is  at  the  input  of  the  barrel  shifter,  which  can  then  be  used  to  rotate  the  bits  to  the  MSBs, 

leaving  the  LSBs  free  to  get  another  literal  on  the  next  clock  cycle.  This  way  a  full  32  bit  data 
word  could  be  formed. 

Here  again,  feedback  from  the  Computer  Architecture  class  pointed  out  the  need  to  have  the 
I  literal  inserter  drive  the  upper  A  bus  also,  despite  the  fact  that  the  barrel  shifter  is  not  available  I 

there.  The  reason  was  that  tying  the  A  busses  meant  being  unable  to  use  the  lower  datapath, 
which  would  have  added  a  clock  cycle  to  the  routine.  In  this,  case  the  resulting  hardware  design 
was  more  complicated  than  the  original  literal  inserter;  but  the  increase  in  flexibility  outweighed 
I  the  increase  in  complexity.  | 

3.6.5  Function  ROM.  The  function  ROM  stores  initial  guesses  (seeds)  for  iterative  sub¬ 
routines.  These  are  routines  which  can  calculate  the  square  root  or  other  function  of  a  number  in 

I  only  a  few  clock  cycles.  An  example  would  be  a  routine  that  uses  the  Newton-Raphson  algorithm  | 

to  calculate  the  square  root  or  inverse  of  a  number.  The  number  of  clock  cycles  needed  to  con¬ 
verge  on  the  final  result  depends  on  getting  a  good  initial  guess,  called  a  seed.  The  function  ROM 
can  store  tables  of  precalculated  seeds  for  eight  different  routines.  The  function  ROM  tables  are 
1  partly  predefined,  for  algorithms  in  permanent  microcode  memory.  The  rest  of  the  ROM  is  laser  | 

programmable,  so  users  can  define  their  own  seeds. 

The  function  ROM  is  intended  for  use  with  floating  point  numbers.  The  table  to  be  used  for 
a  particular  function  is  selected  by  control  bits.  The  MSBs  of  the  floating  point  number’s  mantissa, 
and  the  LSB  of  the  exponent  are  used  to  address  a  specific  seed  value.  This  seed  provides  the  four  I 

mantissa  MSBs  and  exponent  LSB  of  the  initial  guess.  The  input  bits  are  read  off  the  upper  B  bus 
and  the  seed  is  placed  on  the  upper  C  bus.  Figure  3.2  shows  which  bits  these  are. 

3.6.6  Floating  Point  Multiplier.  The  multiplier  is  the  largest  single  cell  in  the  floorplan.  j 

It  has  been  designed  and  is  being  laid  out  by  CPT  Fretheim  ais  a  special  project  [Fre88]. 


3-15 


The  multiplier  is  a  single  combinational  logic  circuit  which  embodies  the  octal  Booth’s  en¬ 
coding  scheme.  The  long  combinational  path  requires  that  the  multiplier  be  given  two  clock  cycles 
to  settle.  This  creates  a  pipelined  microcode  structure.  If  the  multiplier  is  used  continuously,  it  can 
put  out  results  every  two  clock  cycles.  An  extra  cycle  is  needed  to  initially  lo2ui  the  input  registers. 
After  that,  the  next  input  can  be  loaded  in  the  same  cycle  that  the  result  is  driven  out. 

The  selection  of  possible  destinations  for  the  result  was  determined  by  the  microcode  written 
for  the  dot  product  routine.  Originally  the  multiplier  was  to  drive  only  the  C  busses  like  the  other 
data  processing  circuits.  In  the  dot  product  routine,  the  adder  accumulates  the  final  result.  It 
was  more  efficient  to  load  the  adder’s  previous  result  back  into  the  iulder’s  input  register  directly, 
rather  than  have  the  data  be  delayed  by  a  register;  so  the  adder  was  given  the  ability  to  drive  the 
B  busses  also.  The  same  choice  of  output  busses  was  also  given  to  the  multiplier. 

This  permits  100%  utilization  of  these  two  expensive  resources.  The  clock  cycle  where  the 
multiplier  is  settling  can  be  used  to  load  up  the  floating  point  adder,  and  vice  versa,  so  the  two  are 
kept  in  continuous  operation.  This  is  done  in  the  dot  product  routine,  whose  efficiency  is  a  critical 
factor  in  the  overall  efficiency  of  the  Kaimaui  filtering  routine.  The  dot  product  routine  is  listed  in 
Appendix  Bl. 

The  multiplier  supports  the  IEEE  double  precision  floating  point  format.  This  requires  that 
the  multiplier  flag  the  conditions  of  underflow,  zero  result,  denormalized  result,  overflow,  and  a 
result  that  is  not  a  number.  These  flags  are  available  to  the  microsequencer  so  that  branches  to 
exception  handling  routines  can  be  made.  The  flags  which  signal  that  an  invalid  result  in  any  form 
has  been  produced  are  also  OR’d  together  into  a  single  flag  called  TRPS  which  can  be  used  for  a 
single  condition  check.  This  flag  represents  an  overflow  condition  in  the  adder,  multiplier,  upper 
ALU  or  lower  ALU,  or  the  not-a-number  condition  in  the  adder  or  multiplier. 

The  multiplier  is  also  designed  to  perform  integer  multiplies  on  32  bit  wide  words,  producing 
up  to  a  64  bit  result.  The  LSBs  of  the  result  will  be  driven  to  the  lower  B  or  C  busses.  If  the 
result  is  longer  than  32  bits  the  extra  .V^SBs  are  driven  to  the  upper  datapath  and  a  flag  is  raised. 

S.6.7  Floating  Point  Addcr/Subtractor.  The  double  precision  floating  point  adder  is 
being  designed  as  an  ongoing  class  project.  It  will  also  perform  a  floating  point  subtraction.  The 
interface  to  the  adder  is  identical  to  that  of  the  multiplier.  The  adder  will  use  a  carry-select  adder 
to  perform  the  mantissa  additions  or  subtractions. 

The  adder  is  also  being  given  two  clock  cycles  to  settle,  although  it  may  be  possible  for  it  to 
settle  in  one  clock  cycle.  That  will  not  be  determined  until  more  Spice  simulations  are  run. 
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3. 7  Microsequencer 

The  FPASP  is  controlled  by  a  microcode  sequencer.  The  microcode  is  stored  in  two  forms  of 
ROM:  a  fixed  store  of  microcode,  and  a  write-once  microcode  store  used  to  tailor  the  FPASP  to 
the  specific  application.  This  architecture  was  extended  to  allow  the  microaddress  stack  to  extend 
into  the  external  memory,  and  to  allow  for  external  control  of  the  FPASP.  These  two  features  were 
major  areas  of  hardware  research,  and  are  discussed  in  greater  detail  in  Chapter  4. 

A  diagram  of  the  microsequencer  is  shown  in  Figure  3.6.  The  Control  Address  Register  (CAR) 
supplies  a  10  bit  address  to  the  microcode  ROMs.  The  CAR  can  be  reset  to  zero  by  the  GO  signal, 
so  address  oooooooooo  is  the  start  of  the  microcode  routine.  Ten  bits  allow  up  to  1024  words  in  the 
microcode  ROM,  but  only  784  are  used. 

The  CAR  is  loaded  every  clock  cycle  from  a  4-to-l  multiplexer  (mux).  The  choices  of  input 
include  the  incremented  value  of  its  present  contents,  an  address  from  the  microcode  word  presently 
on  the  control  bus,  the  address  on  the  top  of  the  stack,  or  an  address  from  the  mapping  ROM. 

The  incremented  address  is  also  the  default  selection,  and  is  the  selection  used  if  the  condition 
for  a  branch  is  false.  The  address  from  the  control  bus  ediows  the  program  to  branch  to  another 
section  of  the  code,  or  to  branch  to  a  subroutine.  If  a  subroutine  branch  is  chosen,  the  incremented 
address  value  is  pushed  onto  the  stack.  The  choice  of  address  from  the  stack  represents  a  return 
from  a  subroutine  call.  The  Mapping  ROM  holds  addresses  that  are  selected  by  bits  in  the  lower 
R1  register.  These  addresses  point  to  the  support  microcode  for  the  assembly  language. 

Conditions  for  branching  are  selected  from  a  48-to-l  mux.  Flags  to  one  section  of  the  mux 
can  be  inverted,  giving  a  total  of  64  branch  conditions.  Two  of  these  selections  are  “unconditionally 
true”  and  “unconditionally  false”.  The  true  choice  is  for  branching  without  reference  to  a  flag;  and 
the  false  choice  is  the  default,  for  executing  sequential  lines  of  microcode.  The  branch  conditions 
and  their  sources  are  listed  in  the  microcode  field  definitions  in  AppendLx  Al. 

The  choice  of  flags  changed  throughout  the  architectural  specification  process  with  feedback 
from  the  software  and  hardware  research.  One  such  change  was  mentioned  above  for  the  shifted-oiit 
bit.  Another  was  a  suggestion  from  the  Kalman  filter  research  group  to  have  8  externally  settable 
flags.  These  were  originally  to  be  stored  in  a  separate  register,  but  this  was  changed  to  having 
them  come  directly  from  the  upper  R1  register’s  four  MSBs  and  four  LSBs.  The  idea  of  having 
the  R1  flags  input  from  outside  the  FPASP  was  intended  only  to  allow  the  user  to  choose  between 
separate  routines  in  the  ROM,  a  simplistic  form  of  external  instruction  decoding.  This  feature  was 
later  extended  to  include  input  of  control  bits  ae  well  as  flag  bits. 
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Figure  3.6.  Microsequencer  Block  Diagram. 
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The  breuich  condition  and  type  of  branch  are  selected  by  bits  from  the  control  bus  and  control 
bits  generated  in  the  microeequencer.  The  “branch  logic”  cell  shown  in  Figure  3.6  is  a  collection  of 
gates  which  use  these  control  bits  to  select  which  of  the  four  available  addresses  will  pass  through 
the  4-to-l  mux  into  the  CAR.  This  cell  also  generates  the  control  signals  needed  by  the  stack  and 
MAP. 

The  selected  address  is  loaded  into  the  CAR  and  driven  to  the  microcode  ROM  on  the  next 
clock  cycle.  The  microcode  control  word  out  of  the  ROM  is  further  delayed  one  clock  cycle  by  a 
set  of  master-slave  flip-flops  called  the  Pipeline.  This  pipelined  control  architecture  provides  an 
orderly  presentation  of  the  control  bits  to  the  datapath,  but  it  also  introduces  a  latency  between 
the  time  the  address  is  chosen  and  the  time  its  microcode  control  word  is  executed. 

The  results  of  this  latency  are  that  the  flags  cannot  be  used  as  branch  conditions  in  the  same 
line  of  code  they  are  generated  on,  and  the  instruction  immediately  after  the  branch  instruction  is 
performed  even  if  the  branch  is  taken.  These  restrictions  usually  have  no  effect  on  the  efficiency  of 
the  microcode,  but  sometimes  a  no-operation  (nop)  line  of  code  must  be  inserted  after  a  branch  to 
prevent  unwanted  computation.  Examples  the  effects  of  the  latency  are  given  in  Chapter  4. 

3. 7.  J  Microaddress  Stack  Architecture  for  Recursion.  One  of  the  original  problems 
to  be  solved  by  the  FPASP  architecture  was  how  to  support  recursive  microcode  routines.  Since 
these  routines  continually  call  themselves,  and  can  call  other  routines  as  well,  the  depth  of  the  stack 
becomes  the  determining  factor  in  the  amount  of  recursion  the  FPASP  can  support.  The  stack 
registers  take  up  area  on  the  chip,  so  the  depth  of  the  stack  cannot  be  increased  without  an  area 
and  cost  penalty. 

The  solution  chosen  for  the  FPASP  is  to  write  the  stack  out  to  external  memory  when  it 
gets  full.  After  the  stack  is  filled,  each  successive  call  pushes  the  lowest  address  on  the  stack  out 
to  external  memory.  A  return  then  pops  that  address  back  into  the  bottom  of  the  stack.  The 
architecture  of  this  stack  is  shown  in  Figure  3.7.  For  most  applications  the  on-chip  stack  will  be 
deep  enough  so  that  external  memory  will  not  be  needed. 

This  architecture  provides  a  pair  of  counters  to  keep  track  of  the  address  and  status  of  the 
external  stack.  The  stack  entry  and  address  for  the  external  memory  are  sent  directly  to  the  lower 
data  and  address  pads,  so  the  datapaths  are  not  affected. 

The  only  restriction  this  architecture  places  on  the  microcode  is  that  there  cannot  be  a  lower 
memory  access  on  the  same  cycle  as  the  call  or  return.  Except  for  this  restriction,  the  operation  of 
the  stack  is  transparent  to  the  user.  Any  inconsistency  that  arises  from  a  stack  operation  causes 
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Figure  3.7.  Stack  Architecture. 


the  FPASP  to  branch  to  a  trap  routine  which  halts  execution  and  saves  the  state  of  the  entire 
machine.  This  routine  uses  the  stack  counters  to  perform  the  save.  The  programmer  can  use  this 
routine  also,  or  write  one  of  their  own  if  they  do  not  want  execution  to  halt  on  a  trap  condition. 

5.7.2  Microcode  Storage.  The  fixed  portion  of  the  microcode  memory  resides  in  the 
XROM.  This  is  a  standard  cell  compiled  by  software  in  the  AFIT  CAD  environment.  The  program 
and  cell  designs  eure  the  result  of  AFIT  thesis  research  [Ros85]  and  ongoing  improvement  by  the 
AFIT  faculty  [Lin88-2]. 

The  FPASP  design  methodology  is  to  fabricate  the  FPASP  in  quantity  and  laser  program 
it  for  each  specific  application.  This  approach  to  ASP  design  runs  counter  to  the  practice  of 
designing  a  new  ASP  for  each  application.  The  technology  that  makes  this  possible  is  the  Leiser 
programmable  ROM  or  LPROM.  The  laser  programming  technique  is  compatible  with  the  MOSIS 
design  rules  and  fabrication  processes.  If  the  designer  had  more  control  over  the  fabrication,  a 
different  programmable  store  could  be  used.  A  better  one  might  be  a  UVPROM  or  the  Flash 
memory  recently  announced  by  the  Intel  Corporation  [Hit88]. 

The  design  of  the  LPROM  is  the  topic  of  Capt.  Tillie’s  research,  which  was  carried  out  in 
parallel  with  this  effort  [Til88).  Capt.  Tillie  has  written  the  tools  needed  to  optimize  the  layout  of 
the  LPROM  bits  and  drive  the  hardware  which  does  the  actual  programming.  The  design  for  the 
LPROM  cell  was  available  at  the  time  the  FPASP  was  floorplanned.  The  LPROM  is  eight  times 
less  dense  than  the  XROM,  so  there  are  fewer  words  of  LPROM  available. 

The  FPASP  provides  640  words  of  XROM  and  144  words  of  LPROM.  Each  word  is  128  bits 
wide,  but  physically  they  are  split  into  two  64  bit  words.  This  split  solves  several  problems.  The 
first  is  that  an  XROM  128  bits  wide  would  be  too  slow  due  to  wide  wordlines.  The  second  is  that 
it  is  easier  to  fit  into  the  FPASP  floorplan  as  two  smaller  pieces.  The  third  is  that  the  GMAT 
microcode  assembly  tool  cannot  handle  integers  larger  than  128  bits  at  this  time. 

The  split  also  made  it  natural  to  put  control  bits  for  the  upper  datapath  on  the  upper  ROM 
and  likewise  for  the  lower  datapath  control  bits.  Bits  used  for  neither  datapath  were  split  up  to 
balance  out  the  number  of  bits  on  each  half  to  64.  In  the  original  floorplan  of  the  FPASP,  the 
ROMs  were  actually  arranged  facing  their  respective  datapaths,  reinforcing  the  notion  of  “upper” 
and  “lower.”  This  also  decreased  the  length  of  the  control  busses.  The  ROMs  are  no  longer  in  this 
orientation,  but  the  bits  have  been  rearranged  so  they  are  still  as  close  as  possible  to  their  final 
destination. 
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Since  there  is  less  LPROM  than  the  XROM,  careful  consideration  must  be  given  to  which 
code  belongs  in  the  XROM  and  which  will  be  written  into  the  LPROM.  The  normal  approach 
would  be  to  write  a  routine  into  the  LPROM  which  calls  the  generic  routines  in  the  XROM. 

The  FPASP  must  be  able  to  do  Kalman  filtering  even  if  the  laser  programming  capability 
is  not  available;  therefore  the  FPASP  XROM  contains  all  of  the  code  needed  to  perform  that 
algorithm.  That  routine  has  been  written  to  use  the  matrix  algebra  routines  that  would  exist  in 
the  XROM  anyway,  so  the  overhead  of  the  more  generic  architecture  of  the  FPASP  is  decreased. 

S.  7. S  Microinstruction  Pipeline.  The  microcode  control  bits  from  the  X ROM  or  LP RO M 
are  loaded  into  the  pipeline  registers  and  driven  out  to  the  datapath  hardware  on  the  next  clock 
cycle.  This  allows  more  time  for  decoding  the  control  bits.  The  bits  go  out  on  the  rise  of  the  first 
clock  pulse  instead  of  after  the  precharge  period  and  access  delay  of  the  ROMs. 

This  also  isolates  the  control  lines  from  the  transients  coming  out  of  the  ROMs.  Since  the 
ROMs  are  precharged,  their  outputs  change  to  the  precharge  level  and  then  settle  to  their  correct 
value.  This  pulse  of  incorrect  control  would  cause  the  decoders  to  dissipate  power  needlessly.  It 
could  also  cause  a  change  of  state  that  could  not  be  recovered  from,  as  in  the  case  of  driving  a 
precharged  bus  incorrectly. 

The  pipeline  allows  the  control  bits  an  entire  clock  cycle  to  settle  before  it  latches  them  in. 
That  feature  becomes  crucial  when  the  control  bits  are  to  be  overridden  by  bits  from  the  datapath, 
which  occurs  in  the  case  of  the  assembly  language  execution.  The  new  control  bits  have  plenty 
of  time  to  arrive  and  stabilize.  This  way,  their  release  to  the  decoders  on  the  next  clock  cycle  is 
synchronized  just  like  ROM  control  bits. 

3.7.3.]  Pipeline  Scanpatli.  The  introduction  of  the  pipeline  registers  into  the  con¬ 
trol  path  also  makes  it  easy  to  add  design-for-testing  (dft)  into  the  FPASP  architecture.  The 
purpose  of  dft  is  to  decrease  the  amount  of  work  needed  to  test  the  chip.  The  additional  hardware 
gives  the  tester  more  controllability  and  observability  over  the  state  of  the  machine  [Fuj85]. 

In  the  FPASP,  the  dft  hardware  consists  of  a  set  of  muxes  which  allow  the  pipeline  registers 
to  be  chained  together  to  form  a  serial  scanpath.  This  path  allows  the  control  bus  bits  to  be  shifted 
out  to  one  pad  at  a  time,  providing  complete  observability  of  the  control  word.  Controllability  is 
also  provided  since  the  scanpath  can  also  shift  in  new  bits  at  the  same  time.  Thus,  the  tester  has 
more  control  over  the  state  of  the  entire  machine. 

The  clock  lines  to  the  pipeline  can  be  isolated  from  the  clock  lines  to  the  rest  of  the  chip.  The 
scanning  is  done  while  the  clocks  to  the  rest  of  the  chip  are  disabled,  so  its  state  does  not  change 


with  the  changing  control  word.  Separate  test  input  and  output  pins  allow  the  scanned  word  to  be 
fed  directly  back  in,  so  machine  can  pick  up  from  where  it  left  off. 


3.8  Assembly  Language  Support 

The  idea  of  having  flag  bits  from  the  upper  R1  register  was  intended  to  allow  the  user  to 
choose  between  separate  routines  written  into  the  LPROM.  The  limited  area  available  for  microcode 
ROMs  meant  that  the  user  written  code  was  limited  to  144  words. 

This  limit  was  approached  in  some  of  the  code  written  for  the  EE588  class  projects.  Even 
though  those  routines  were  not  fully  optimized,  the  fact  that  they  came  close  to  the  limit  indicated 
that  the  FPASP  should  allow  for  external  input  of  control  words,  not  just  flags. 

So  the  FPASP  architecture  was  extended  to  support  an  assembly  language.  Since  the  spec¬ 
ification  of  the  architecture  was  well  along,  the  assembly  language  was  designed  to  fit  in  with  the 
existing  hardware  and  to  use  a  minimum  of  new  hardware.  Following  this  philosophy,  only  exist¬ 
ing  registers  were  used  for  the  instruction  register  (IR)  and  program  counter  (PC).  The  upper  R1 
already  had  ties  to  the  control  section  by  the  flags  mentioned  above,  so  it  was  used  as  the  IR.  The 
lower  R1  also  became  necessary  as  an  IR. 

The  way  additional  decoding  hardware  was  minimized  was  to  bring  in  identical  copies  of 
the  control  fields  the  ROMs  put  out.  The  original  idea  was  to  replace  all  the  bits  one-for-one,  so 
there  would  only  be  a  single  instruction  format.  That  was  obviously  impractical  since  it  would 
require  four  registers  and  two  memory  accesses.  It  would  also  be  wasteful  since  all  of  the  internal 
capabilities  are  not  needed  when  the  code  is  supplied  from  the  outside.  For  example,  the  literal 
inserter  is  not  needed  since  the  data  would  be  coming  in  from  the  memory  anyway. 

One  basic  control  to  override  is  the  register  select  fields,  so  the  assembly  instruction  can 
specify  the  source  and  destination  registers.  The  bus  ties  must  also  be  overridden  to  get  complete 
control  of  the  entire  bus  architecture.  These  fields  have  a  combined  bit  count  of  33,  larger  than  the 
upper  Rl,  so  the  lower  R1  was  added.  The  E  bus  tie  is  not  included  in  the  count,  since  both  Rl’s 
are  needed  to  hold  replacement  control  bits.  The  most  efficient  way  to  fill  both  R'i’s  is  to  treat 
the  assembly  language  words  like  floating  point  numbers  and  store  both  halves  in  corresponding 
memory  locations. 

The  A  pointer  is  used  as  the  PC,  so  its  increment  register  is  loaded  with  a  ‘1’.  The  microcode 
written  to  support  this  assembly  language  allows  three  addressing  modes.  The  default  mode  is  to 
use  the  incremented  value  of  the  A  pointer.  The  ‘immediate’  mode  takes  the  address  in  the  20 
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LSBs  of  the  upper  Rl.  The  ‘indirect’  mode  uses  the  address  in  any  one  of  the  four  pointers  to  read 
in  the  address  of  the  final  address  to  be  used.  The  final  address  in  turn  is  put  into  the  MAR. 

The  control  pipeline  latency  does  not  affect  the  assembly  language  flow;  all  of  that  is  absorbed 
by  the  support  microcode.  The  flags  are  latched  from  the  previous  assembly  instruction,  and  so 
are  available  immediately  to  the  assembly  instruction  ‘‘br^tnch.’’ 

There  are  six  instruction  formats.  Each  format  has  fields  for  the  replacement  control  bits 
needed  by  the  hardware  used  to  execute  that  instruction.  These  fields  are  arranged  to  overlap  as 
little  as  possible,  so  there  are  not  too  many  muxes  on  any  bit  of  the  Rl  registers. 

When  the  replacement  control  bits  are  needed,  they  are  selected  by  bits  in  a  new  microcode 
field.  These  new  bits  control  muxes  which  select  the  IR  bits  over  the  bits  coming  out  of  the  ROMs. 

Since  the  actual  control  bits  are  replaced,  most  of  the  assembly  language  instructions  can 
perform  as  many  different  operations  with  the  FPASP  hardware  as  the  original  microcode  field. 
For  example,  the  ALU/shifter  instruction  can  do  any  combination  of  alu  and  shifter  operations. 
On  the  other  hand,  the  barrel  shifter  instruction  cannot  access  the  barrel  control  register  since  that 
mode  of  operations  not  feasible  when  several  microinstructions  are  performed  for  each  assembly 
language  instruction. 

The  increased  number  of  clock  cycles  needed  for  the  assembly  language  instruction  makes  that 
method  of  programming  the  FPASP  inefficient  for  time  critical  operations  such  as  the  dot  product 
routine,  which  is  needed  N  cubed  times  in  a  matrix  multiply.  Rather,  the  assembly  language 
should  be  used  to  extend  the  sequential,  routine-calling  portion  of  the  application.  Thus,  the  heavy 
processing  is  done  by  the  more  efficient  microcode  routines  in  the  XROM  library  or  those  made  up 
by  the  user  in  the  LPROM. 

One  piece  of  additional  hardware  that  could  not  be  easily  avoided  was  a  decoder  to  select  the 
address  of  the  microcode  for  the  operation  the  opcode  represents.  Since  a  small  LPROM  already 
existed  in  the  form  of  the  function  ROM,  the  mapping  cell  was  made  an  LPROM.  The  decision 
to  do  the  MAP  with  an  LPROM  gives  this  assembly  language  an  additional  feature:  it  can  be 
partially  user-defined. 

The  MAP  is  addressed  by  the  five  opcode  bits  in  the  lower  Rl.  Five  bits  allows  32  opcodes. 
Of  the  32  possible  opcodes,  one  is  used  for  the  TRAP  routine,  ten  are  used  for  the  pre-defined 
instructions,  and  21  are  left  over  for  the  "user  to  program.  The  predefined  macro  instructions  are 
listed  in  Table  3.3  along  with  the  operations  they  can  perform. 
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load  one  or  two  registers  from  external  memory 

Store 

store  one  or  two  register’s  contents  to  external  memory 

Branch 

branch  to  an  externa!  memory  location  on  any  branch 
condition,  using  direct  or  indirect  addressing 

Call 

call  an  external  subroutine,  using  direct  or 
indirect  addressing 

Return 

return  from  an  external  subroutine 

ALU 

perform  any  ALU  operation  with  one  or  both  ALUs 
with  any  source  or  destination  registers 

Bshift 

perform  a  barrel  shift  with  any  source  or 
destination  registers 

Ptr/Inc 

increment  a  pointer  or  incrementable  register, 
or  load  a  pointer’s  increment  register 

FP-b 

perform  a  floating  point  addition,  with  any  source  or 
destination  registers 

FP* 

perform  a  floating  point  multiplication,  with  any 
source  or  destination  registers 

Table  3.3.  Assembly  Language  Macro  Instructions. 
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3.8.1  A  User  Defined  Assembly  Language.  With  laser  programmability  in  the  mi¬ 
crocode  LPROM  and  the  mapping  LPROM,  the  user  can  make  up  the  code  needed  to  support  an 
assembly  instruction,  and  then  assign  the  address  of  that  code  to  an  unused  opcode  and  enter  it 
into  the  MAP.  The  control  bits  used  for  the  pre-defined  instructions  are  also  available  to  the  user 
and  they  cover  almost  all  of  the  hardware  in  the  FPASP.  The  user  can  override  those  fields  at  any 
time  in  their  own  routine,  but  is  restricted  to  the  six  choices  listed  in  the  microcode  field  definition 
in  Appendix  Al. 

The  most  conunon  ase  would  probably  just  be  to  branch  to  a  specific  routine  in  the  microcode 
depending  on  the  opcode  entered,  but  instructions  useful  to  a  specific  aesembly  language  program 
could  also  be  created.  For  example,  the  support  code  for  performing  a  multiply-and-accumulatc 
could  be  written  into  the  LPROM,  with  a  branch  at  the  end  to  the  FETCH  routine  used  by 
the  predefined  assembly  language  routines.  The  multiply-and-accumulate  would  then  be  a  new 
instruction  in  the  assembly  language.  The  control  bits  for  selecting  the  source  and  destination 
registers  for  the  new  instruction  would  be  selected  at  the  start  and  end,  just  like  the  predefined 
floating  point  assembly  language  instructions. 

3.9  External  Interfaces 

The  FPASP  depends  on  external  circuitry  to  store  data,  and  to  tell  it  when  to  start  process¬ 
ing.  Two  different  memory  chips  were  researched  for  the  FPASP.  These  were  static  RAM  chips 
with  access  times  less  than  35  nanoseconds.  These  criteria  were  based  on  the  need  to  perform  mem¬ 
ory  accesses  in  one  clock  cycle,  and  the  desire  to  eliminate  refresh  circuitry  and  timing  problems 
associated  with  dynamic  RAMs. 

The  external  memory  would  be  best  organized  with  each  memory  chip  providing  one  bit  of 
the  data.  This  memory  organization  simplifies  aiddressing  since  no  chip  select  lines  must  be  decoded 
from  the  address  bits  put  out  by  the  FPASP.  If  such  a  decoder  were  required  in  the  future,  it  would 
be  best  located  on  the  FPASP  so  that  the  access  time  does  not  suffer  from  having  external  circuitry 
in  the  address  path. 

The  researched  memory  chips  only  require  two  control  signals,  and  they  have  a  zero  nanosec¬ 
ond  delay  from  when  the  data  becomes  valid  to  when  the  write  enable  signal  can  rise  [Per88]. 
This  simplifies  the  timing  of  the  signals  from  the  FPASP.  The  three  signals  in  the  microcode  from 
the  original  chips  were  retained.  The  extra  bits  would  allow  the  FPASP  to  control  other  external 
circuitry  if  required. 
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The  FPASP  also  communicates  with  the  host  processor.  For  its  first  application,  the  FPASP 
will  be  mounted  on  a  bosird  which  is  plugged  into  the  VMEbus  of  a  Sun  workstation.  The  general 
layout  of  this  board  is  shown  in  Figure  3.8.  In  this  case,  the  host  communicates  with  the  FPASP 
through  the  board’s  interface  chip. 

The  entire  operating  cycle  of  the  circuit  would  follow  the  timing  diagram  shown  in  Figure 
3.9.  The  host  fills  the  external  memory  with  data,  puts  its  address  and  data  lines  to  the  memory 
chips  in  a  high  impedance  state,  then  raises  the  go  line.  When  the  FPASP  finishes  it  switches  its 
bus  drivers  to  high  impedance  and  raises  the  done  line.  This  signals  the  host  that  it  can  unload 
the  results  and  start  over  again. 

The  FPASP  also  has  an  external  interface  in  the  form  of  two  pins  which  run  directly 
condition  mux  in  the  microsequencer.  These  allow  the  external  circuitry  to  control  the  flow 
microcode.  These  flags  are  called  loi  and  102.  They  could  be  used  for  interrupt  signals 
FPASP  controls  external  devices. 
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Figure  3.8.  Circuit  Board  for  VMEbus. 
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Figure  3.9.  Host-FPASP  Handshaking 
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IV.  VLSI  Implementation 


4.1  Introduction 

This  chapter  presents  the  hardware  designs  that  will  embody  the  FPASP  architecture.  This 
corresponds  roughly  to  VLSI  portion  of  the  design  hierarchy  shown  in  Figure  4.1.  The  implemen¬ 
tation  will  be  in  CMOS,  so  that  influence  cannot  be  completely  left  out  of  the  picture. 
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Figure  4.1.  Design  Areas  Covered  in  this  Chapter. 

The  discussion  will  be  mostly  at  the  block  diagram  level  of  detail,  though  occasionally  the 
details  of  the  CMOS  design  will  be  necessary  to  fully  explain  why  a  particular  design  Wcis  chosen. 
Microcode  examples  will  be  used  to  show  how  the  hardware  heis  been  designed  to  support  software 
structures. 
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4 . 2  Floorplanning 


The  hardware  design  began  with  a  survey  of  the  cell  libraries  available,  especially  the  ASP 
library  created  by  Capt.  Gallagher.  Even  though  most  of  these  cells  could  not  be  used  in  the 
FPASP  directly,  they  did  provide  a  good  size  estimate  for  the  FPASP  floorplan. 

After  the  first  register  level  description  of  the  FPASP  was  done,  outlines  of  the  hardware  it 
called  for  were  laid  out  in  magic  as  large  squares  of  different  layers.  Different  layers  were  used  only 
for  their  colors,  so  smaller  macrocells  would  be  easy  to  see.  Contact  layers  were  not  used  since  they 
do  not  plot  as  blocks;  they  plot  as  a  mass  of  small  dots. 

These  blocks  of  color  were  then  arranged  in  as  square  a  fashion  as  possible,  while  maintaining 
what  was  felt  to  be  a  good  organization.  “Good"  organization  was  one  that  limited  the  length  of 
the  various  busses,  especially  the  control  busses.  A  drawing  of  this  original  floorplan  is  shown  in 
Figure  4.2. 
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Figure  4.2.  Original  FPASP  Floorplan 


The  estimates  of  the  datapath  hardware  were  accurate,  but  only  very  rough  estimates  of  the 
floating  point  hardware  were  available.  At  first,  the  2uider  was  given  a  large  area  and  put  on  the 
opposite  side  of  the  floorplan  from  the  multiplier.  This  turned  out  to  be  impossible  to  build  because 
the  same  physical  ch^umel  was  used  for  the  B  bus  of  the  data  registers  and  the  E  bus  of  the  address 
registers  to  create  the  overlapped  register  sets.  This  isolated  the  adder  from  the  B  bus. 

When  a  better  estimate  of  the  size  of  the  floating  point  multiplier  and  adder  were  available, 
the  floorplan  was  rearranged,  with  the  result  that  the  adder  and  multiplier  were  placed  side-by-side. 
The  final  floorplan  is  shown  in  Figure  4.3.  The  rearrangement  of  cells  had  a  great  effect  on  the 
architecture  as  was  seen  in  Chapter  3,  with  the  addition  of  the  B  and  C  bus  ties.  What  made  the 
extra  bus  ties  possible  was  that  the  busses  all  came  together  between  the  floating  point  macrocells. 
The  extra  ties  no  longer  required  additional  channel  space  since  the  extra  channel  was  already  paid 
for  to  get  all  of  the  busses  to  the  adder. 

Better  estimates  also  became  available  for  the  microcode  ROMs.  The  new  arrangement 
allowed  the  ROMs  to  be  rotated  and  made  deeper  since  they  were  no  longer  affecting  a  critical 
dimension  of  the  chip.  Another  result  of  the  new  floorplan  was  that  the  control  section  was  put 
next  to  its  I/O  pads,  rather  than  having  signals  go  around  or  through  the  adder.  This  can  be  seen 
in  the  pad  arrangement  indicated  in  Figure  4.3. 

4-3  Timing 

The  FPASP  is  designed  for  a  clock  cycle  of  40  nsec,  corresponding  to  an  operating  frequency 
of  25  MHz.  The  clock  cycle  is  shown  in  Figure  4.4.  The  clock  signals  are  non-overlapping  to  prevent 
data  from  racing  through  the  master-  slave  flip-flops.  The  <I‘i  clock  pulse  is  shorter  than  the  <I>; 
clock  because  it  is  also  used  as  the  prechauge  signal.  The  <^2  clock  pulse  width  is  less  critical,  it 
only  needs  to  be  non-overlapping  and  long  enough  for  the  registers  to  latch  their  inputs.  The  actual 
pulse  width  used  for  ^2  can  be  whatever  meets  these  criteria  and  is  easiest  for  the  clock  generator 
to  produce. 

The  FPASP  has  two  pins  for  the  precharge  signal  even  though  the  design  calls  for  precharge 
and  <^1  to  be  the  same.  This  lowers  the  load  on  the  pins  used  for  logic,  and  allows  precharge  to 
be  separated  from  for  testing.  The  clock  lines  to  the  pipeline  registers  also  have  separate  pins  so 
that  they  can  be  isolated  from  the  rest  of  the  circuit  for  operating  th"  serial  scanpath  for  testing. 
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Figure  4.4.  FPASP  Clock  Cycle. 

4  4  Busses  and  Ties 

The  FPASP  has  a  large  set  of  busses  for  the  various  operations  it  performs.  The  arrangement 
of  the  busses  inside  the  various  cells  is  shown  in  Figure  4.5.  This  bit-slice  arrangement  was  the 
main  design  feature  carried  over  from  the  ASP  cells.  The  arrangement  of  the  various  busses  and 
the  possible  ways  to  interconnect  them  are  shown  in  Figure  4.6. 

The  connection  of  the  upper  and  lower  halves  of  each  bus  are  done  with  the  bus  ties.  Inter¬ 
connection  between  the  A  'r  B  busses  and  the  C  busses  can  take  place  through  the  ALU/Shifters 
using  the  Movn.pass  command.  Data  can  be  swapped  between  upper  and  lower  registers  in  a 
single  clock  cycle  using  these  commands  and  the  bus  ties;  an  unplanned  but  useful  result  of  adding 
the  extra  bus  ties. 

The  bus  ties  are  large  pass  transistors.  The  control  for  each  tie  is  a  single  bit  in  the  microword 
that  drives  the  transistors’  gates.  For  the  precharged  busses,  a  single  N  device  suffices,  while  the 
driven  busses  require  a  full  T-gate. 

4-4-J  Precharged  Busses.  The  FPASP  uses  precharged  A,  B  and  E  busses.  This  design 
was  chosen  for  several  reasons,  the  first  being  that  the  ASP  cells  were  designed  for  these  busses. 
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Figure  4.6.  Bus  Interconnections. 


Precharged  busses  offer  some  adv2iiitages  over  a  driven  busses.  The  most  important  one  is  speed; 
there  are  over  30  drivers  on  these  lines,  each  with  some  drain  capacitance.  Eliminating  the  larger  P 
devices  from  the  bus  removes  more  than  half  the  drain  capacitance.  The  result  is  less  charge  that 
has  to  be  moved  on  and  off  the  bus  to  change  its  state.  This  speeds  up  the  transition  and  decreases 
the  rise  and  fall  times.  Shorter  rise  auid  fall  times  allow  the  inverters  gated  by  the  bus  to  dissipate 
less  current  since  they  pass  through  their  switchover  point  faster. 

Removing  the  P  transistors  also  makes  the  register  cells  much  smaller  since  the  P  transistor 
is  usually  the  larger  of  the  two  drivers.  The  details  of  sizing  the  transistors  are  discussed  in  more 
detail  in  Chapter  5.  With  the  P  device  out  of  the  cell,  the  decoder  no  longer  has  to  supply  the 
inverse  of  the  drive  signal. 

4-4-S  Driven  Busses.  The  C  busses  are  driven  because  the  macrocells  that  drive  them 
have  asynchronous  settling  times,  and  the  bits  driven  out  may  switch  as  the  circuit  settles.  Most  of 
the  load  on  the  C  busses  are  the  drains  of  small  T-gate  muxes.  The  drivers  for  these  busses  do  not 
need  to  be  as  large  as  the  ones  on  the  A  and  B  busses  since  the  total  drain  capacitance  is  small. 

The  busses  from  the  MBRs  to  the  pads  are  called  the  D  busses.  These  busses  are  also 
accessible  to  the  Rl  and  R2  registers  for  reading  in  data.  They  are  driven  since  there  is  also  little 
drain  capacitance  on  these  busses,  making  them  easy  to  drive  quickly.  The  same  is  true  of  the 
Address  busses,  which  go  only  from  the  MARs  to  address  pads. 

4-4-3  Bus  Select  Decoders.  The  signals  that  control  access  to  the  A, B,  and  C  busses  come 
from  a  common  set  of  decoders  located  below  the  registers.  Each  control  field  is  five  bits  wide, 
allowing  for  32  choices.  The  choices  for  each  field  are  listed  in  the  definitions  of  the  microcode  fields 
listed  in  Appendix  A. 

In  all  cases  the  default  code  of  ooooo  is  used  for  choosing  no  macrocell  to  load  from  or  drive 
onto  a  bus.  This  allows  cells  not  controlled  by  these  fields  to  have  access  to  the  busses.  The 
microcode  must  not  cause  a  condition  where  two  macrocells  are  driving  a  bus  at  the  same  time. 
An  exception  to  this  rule  involves  the  Literal  Inserter,  which  can  be  used  to  mask  data  on  the  A 
bus  by  driving  the  bus  at  the  same  time  as  a  register. 

Decoding  circuitry  to  prevent  this  control  problem  would  be  comple.x  and  take  up  e.vtra  area 
for  logic  and  wiring.  It  would  also  slow  down  the  control  signals,  possibly  preventing  them  from 
settling  fast  enough.  Therefore,  deconflicting  of  the  microcode  is  left  up  to  the  programmer. 

The  decoding  circuitry  that  controls  connections  with  a  precharged  bus  must  contain  a  gate 
for  enabling  the  output  only  when  the  precharge  cycle  is  over.  This  prevents  the  selected  register’s 


4-8 


pulldown  device  from  fighting  with  the  pullup  and  dissipating  power  as  heat.  Also,  since  the  bus 
has  no  way  to  recover  its  charge,  this  gate  is  always  the  last  combinational  gate  before  the  stage-up 
and  driver  inverters.  This  way,  a  straggling  control  bit  will  not  change  the  state  of  the  control  line 
amd  cause  a  bus  to  be  discharged  in  error. 

In  addition  to  the  three  general  bus  select  fields,  the  special  purpose  registers  have  their  own 
control  fields.  The  MAR  control  field  choices  nmaru  and  nmarl  take  the  FPASP  off  of  the 
external  address  busses  by  putting  the  address  and  data  pads  in  a  high  impedance  output  state. 
This  is  the  choice  used  when  the  FPASP  is  done  processing  so  the  host  can  take  over  the  external 
busses. 

The  MBR  control  field  controls  the  registers  which  load  from  the  D  bus:  the  MBR,  Rl,  and 
R2.  This  makes  possible  another  illegal  microcode  combination:  loading  Rl  or  R2  from  the  C  and 
D  busses  at  the  same  time.  Rl  and  R2  are  also  treated  as  general  purpose  registers  and  so  they 
have  the  same  decoders  as  those  registers,  in  addition  to  the  D  bus  decoder.  The  MBR  drives  the 
D  bus  only  when  the  Write  Enable  bit  for  that  datapath  is  low,  which  causes  a  memory  write. 

4.‘(-4  Microcode  Examples.  This  section  will  present  portions  of  FPASP  microcode  that 
demonstrate  how  the  bus  design  make  possible  efficient  routines.  Figure  4.7  shows  three  examples. 
The  first  example  shows  the  lower  ALU  adding  inputs  from  both  the  upper  and  lower  datapaths, 
and  returning  the  sum  to  both  datapaths. 

The  second  example  performs  a  set  of  data  moves  using  the  A,B,C  and  D  bu.sses.  Data  is 
written  out  of  the  MBRs  and  also  saved  over  in  the  Rl’s.  Meanwhile,  the  upper  Rr2  is  cleared, 
and  its  previous  contents  are  passed  to  the  lower  R22  through  the  tied  A  bus.  Note  that  the  E 
busses  cannot  be  used  for  data,  since  they  go  only  to  the  MARs. 

The  third  example  shows  a  line  of  code  which  uses  every  bus  on  the  FPASP.  The  A  and  B 
busses  are  loading  data  into  the  floating  point  adder,  while  the  tied  C  busses  are  being  grounded 
by  the  upper  Shifter  to  reset  two  of  the  incrementable  registers.  The  E  busses  are  tied  together  so 
the  A  pointer  can  be  loaded  into  both  MARs.  Meanwhile,  the  MARs  and  MBRs  are  writing  out 
the  result  of  a  previous  operation,  using  the  Address  and  D  busses.  At  the  same  time,  tlie  ALUs 
are  setting  their  carry  flags  for  subsequent  operations  or  branches.  While  all  that  is  going  on,  the 
A  and  C  pointers  are  being  incremented,  as  well  as  two  incrementable  registers. 
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Figure  4.7.  Microcode  Examples  of  Bus  Usage. 
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4-5  Register  Arrays 


The  registers  available  to  each  datapath  are  shown  in  Figure  4.8.  The  decoders  face  the  center 
of  the  chip,  towards  the  control  section.  The  split  between  the  B  and  E  busses  is  at  the  edge  of  the 
pointers.  To  the  left  of  the  split  is  all  of  the  processing  hardware,  and  to  the  right  are  the  pointers 
and  address  registers. 


Integer 

Processing 

Macrocells 


Figure  4.8.  FPASP  Register  Set. 

All  of  the  registers  are  master-slave  flip-flops.  The  input  is  latched  on  the  fall  of  <^2-  The 
outputs  are  driven  on  the  rise  of  <ti  for  the  driven  busses,  and  the  fall  of  precharge  for  the  prcchargecl 
busses.  This  type  of  register  can  be  loading  in  a  new  value  at  the  same  time  it  is  driving  out  the 
old  one. 

4-5.1  Additional  Increment  Hardware.  The  six  incrementable  registers  (three  on  each 
datapath)  are  provided  to  support  looping  in  the  microcode.  This  is  done  by  having  the  incremented 
version  of  their  present  value  available  for  loading  on  the  next  clock  cycle.  They  also  produce  a 
flag  to  indicate  that  they  have  loaded  a  zero. 
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These  registers  will  produce  the  zero  flag  whether  they  have  loaded  all  zeroes  from  either  the 
C  bus  or  the  incrementer.  So  they  can  be  used  to  generate  a  zero  braurch  condition  if  the  ALU  is 
busy  doing  something  else.  They  can  also  be  used  as  general  purpose  registers  if  needed. 

The  incrementer  logic  is  derived  from  the  logic  for  a  full  adder  whose  B  input  is  always  zero: 

Sum  =  A0O®Cin  =  A©Cin 
Carry  out  =  A  O  +  A  Cin  +  O  Cin  =  A  Cin 

For  the  first  stage,  Cin  (carry  in)  is  always  1,  so  the  logic  further  reduces  to: 

Sum  =  A®1  =  A 
Carry  out  =  A-l  =  A 

The  incrementable  registers  use  a  ripple  half-adder  to  generate  the  increment  of  the  register 
value.  The  ‘A’  input  to  this  incrementer  comes  directly  from  the  register,  so  it  starts  to  change  as 
soon  as  the  slave  half  of  the  register  loads.  This  allows  nearly  the  entire  clock  cycle  to  propagate 
the  carry  through  the  adder. 

The  data  in  the  incrementable  registers  can  be  incremented  every  clock  cycle,  but  the  incre¬ 
mented  Vcilue  of  newly  loaded  data  is  not  available  on  the  next  clock  cycle.  It  takes  one  cycle  to 
get  the  value  through  the  incrementer  so  it  is  not  available  until  the  second  cycle  after  it  has  been 
loaded. 

^.S.2  Pointer  Registers.  The  Pointer  registers  actually  consist  of  three  parts:  the  register 
which  holds  the  pointer  value,  a  register  to  hold  the  increment  amount,  and  a  full  adder.  Since 
the  full  adder  turned  out  to  be  too  large  to  make  one  for  each  pointer,  the  two  pointers  on  each 
datapath  share  an  adder.  The  arrangement  of  the  two  pointer  registers  and  the  two  increment 
registers  with  the  carry-select  adder  is  shown  in  Figure  4.9.  The  choice  of  which  pair  to  add  is 
made  by  2:1  muxes  on  the  inputs  to  the  adder  and  inputs  to  the  pointer  registers.  The  control 
fields  for  these  registers  are  listed  in  Appendix  A. 

The  pointer  registers  are  the  same  as  the  incrementable  registers  described  above:  they  are 
general  purpose  registers  with  an  extra  input  to  load  the  adder  result.  The  increment  registers  are 
the  same  as  the  general  purpose  registers  but  without  pulldown  transistors  for  the  A  and  B  busses, 
since  they  drive  only  to  the  carry-.select  adder.  The  adder  used  to  increment  the  pointers  is  derived 
from  the  carry-select  adder  used  in  the  ALUs. 
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Figure  4.9.  Pointer  Registers  and  Adder. 

The  increment  registers  are  32  bits  wide,  so  they  can  hold  data  if  the  pointers  are  being  used 
to  manipulate  data  rather  than  addresses.  The  increment  registers  are  isolated  from  tlie  busses 
except  for  being  able  to  load  from  the  C  bus.  Driving  the  increment  registers  onto  the  data  busses 
would  have  required  a  selection  choice  out  of  the  A  or  B  register  select  fields,  decreasing  the  number 
of  selections  available  for  general  purpose  registers. 

4-5.2. J  Microcode  Example.  The  microcode  shown  in  Figure  4.10  shows  how  the 
two  types  of  incrementable  registers  support  looping,  and  addressing  into  a  matrix.  The  routine 
simply  puts  whatever  value  is  in  the  MBRs  into  every  element  of  the  vector  whose  starting  address 
is  in  the  A  pointer  (APT).  The  negative  value  of  the  number  of  elements  in  the  vector  is  passed  in 
the  second  upper  incrementable  register  (UIN2).  The  distance  between  the  elements  in  the  external 
memory  is  in  the  increment  holding  register  of  the  A  pointer. 

The  first  line  loads  both  MARs  from  the  A  pointer  using  the  E  bus  tie.  At  the  same  time  the 
A  pointer  is  driving  the  E  bus,  it  is  also  loading  its  incremented  value.  The  loop  counter  is  checked 
to  see  if  it  has  been  incremented  to  zero.  If  it  has,  then  the  calling  routine  is  returned  to.  With 
the  latency  in  the  control  section,  the  line  following  the  return  would  be  done  anyway,  which  would 
write  the  MBRs  out  to  the  address  in  the  MARs.  If  the  loop  were  not  done,  then  the  condition  on 
the  next  line  would  be  true,  and  the  program  would  branch  to  the  top  of  the  loop. 

Again,  the  latency  means  that  the  third  line  would  be  executed  also.  So  the  loop  counter 
would  be  incremented  towards  zero.  At  the  top  of  the  loop,  the  address  of  the  next  element  is  be 
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Figure  4.10.  Example  of  Incrementable  Register  Usage. 

put  into  the  MARs,  the  pointer  bumped  up  to  the  following  element,  and  the  loop  would  iterate 
again. 

When  the  loop  counter  reaches  zero,  the  second  line  is  still  done  due  to  the  latency.  In  this 
case,  the  condition  on  the  second  line  would  be  false.  This  is  important,  because  if  it  were  true, 
the  loop  address  would  be  loaded  into  the  CAR  instead  of  the  incremented  return  address,  and 
the  program  would  branch  back  here.  The  default  in  every  case  where  the  condition  is  false  is 
the  present  address  plus  one.  So  even  though  the  second  line  of  the  loop  is  being  performed,  the 
“present  address”  further  down  the  control  pipeline  is  the  return  address  of  the  calling  routine, 
which  was  popped  off  the  stack  by  the  ret  instruction.  Now  the  ‘next  address’  due  to  the  false 
condition  is,  in  this  case,  the  next  address  after  the  return  address,  so  the  program  flows  properly. 

Notice  that  the  loop  counter  had  to  be  incremented  at  least  one  line  before  it  could  be  checked. 
This  is  because  the  flags  all  get  delayed  one  clock  cycle  before  arriving  at  the  condition  mux  in  the 
microsequencer.  This  is  because  the  value  in  the  register  is  not  checked  until  the  beginning  of  the 
next  clock  cycle,  when  it  has  reached  the  output  of  the  slave  half  of  the  register. 
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4.5.3  Instruction  Registers.  In  order  to  support  the  assembly  language,  the  FPASP 
requires  an  instruction  register  to  hold  the  new  control  bits.  Since  the  instructions  must  come  into 
the  FPASP  through  R1,R2  or  the  MBR,  it  made  sense  to  modify  one  of  these  registers  to  act  as 
an  instruction  register. 

The  upper  and  lower  Rl’s  were  chosen  for  this  modification  since  they  are  next  to  the  open 
column  needed  for  the  data  bus.  This  allows  the  bus  carrying  the  instruction  bits  from  the  Rls  to 
the  control  section  to  use  the  space  beneath  the  data  bus.  In  the  case  of  the  upper  Rl,  there  are 
already  eight  flag  lines  running  out  of  the  register  that  can  double  as  instruction  lines.  This  design 
made  it  easy  to  fit  the  assembly  language  support  hardware  into  the  existing  design. 

The  control  bits  go  from  the  Rl’s  to  muxeson  the  pipeline  registers,  where  they  are  chosen  over 
the  bits  coming  out  of  the  microcode  ROMs  if  needed.  A  special  3  bit  field  in  the  microcode  word 
allows  eight  different  sets  of  muxes  to  be  selected,  depending  on  the  instruction  being  performed. 
The  conunands  in  this  field  are  only  used  by  microcode  which  supports  the  assembly  language. 

4.5.4  Assembly  Language  Example.  This  section  shows  the  microcode  routine  used 
to  support  one  of  the  assembly  language  instructions.  This  is  the  instruction  which  does  ALU 
functions.  The  format  for  the  instruction  is  shown  at  the  top  of  Figure  4.11.  The  upper  and 
lower  Rls  hold  the  instruction,  and  the  A  pointer  holds  the  address  of  the  next  instruction.  The 
microcode  that  supports  this  instruction  is  very  simple.  When  called,  it  uses  the  alusel  choice  in 
the  3  bit  field  which  controls  the  mu.xes  on  the  pipeline.  It  overrides  the  control  bits  from  the  ROM 
with  the  control  bits  in  the  Rl  registers.  Then  it  loads  the  next  instruction  into  the  Rl  registers 
and  unconditionally  branches  to  the  FETCH  routine. 

The  FETCH  routine  uses  the  Mapping  ROM  to  select  the  address  of  the  support  routine  for 
that  instruction.  The  A  pointer  is  incremented  to  the  following  instruction.  The  NOP  in  the  second 
line  covers  the  control  pipeline  latency. 

The  ALUSEL  control  selection  enables  the  muxes  which  override  the  bus  select  fields,  the 
bus  tie  fields,  and  the  ALU/Shifter  fields.  These  fields  are  fully  replac'-d  by  the  bits  from  the  Rl 
registers,  giving  the  assembly  language  instruction  full  choice  of  all  the  possible  operations  that 
can  be  done  in  a  single  microcode  cycle. 

The  nop  is  needed  at  the  end  of  the  routine  to  take  up  the  slack  caused  by  the  pipeline 
latency;  the  next  microinstruction  will  be  done  regardless  of  the  branch  in  the  first  instruction. 
The  FETCH  routine  then  reeids  the  next  instruction  into  the  Rls,  and  increments  the  A  pointer  to 
the  next  instruction. 
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selects  the  controls  for  the  ALU/Shifters, 
'^^elects  the  A,B,C  busses  and  ties^^ 


ALU  OP;  TRU(BR)  ALUSEL 

FETCH  ; 


FETCH:  APT+  MAR=EU  ETIE  E=APT  MAP 


Figure  4.11.  Example  of  Assembly  Language  Support  Code. 


4-6  Processing  Components 

This  section  shows  block  diagrams  of  the  various  macrocells  which  perform  the  data  processing 
operations  on  the  FPASP.  The  floating  point  adder  has  yet  to  be  fully  designed,  but  the  type  of 
cells  it  will  use  is  known.  For  all  of  the  macrocelis,  the  decoding  logic  was  simplified  as  much  as 
possible  by  uranging  the  operations  in  an  order  that  would  decrease  the  number  of  boolean  terms 
in  the  control  equation.  The  control  options  are  listed  in  the  microcode  definition  in  Appendix  A. 
Some  of  these  lists  will  be  presented  in  more  detail  below. 

4.6.1  ALU/Shifter.  The  ALUs  perform  simple  arithmetic  or  logic  functions  on  each  of 
the  32  bit  integer  data  paths.  Incorporated  into  each  ALU  cell  is  a  linear  shifter  which  can  perform 
a  left,  right  or  no  shift  on  the  data,  with  a  variety  of  sources  for  the  bit  to  be  shifted  in.  The 
ALU  and  shifter  produce  five  flags  for  conditional  branching.  These  flags  are  held  in  master-slave 
flip-flops  so  they  can  be  used  until  reset. 

The  diagram  of  Figure  4.12  shows  the  ALU/Shifter  and  the  flags  generated  by  each.  Some  of 
the  control  signals  needed  for  this  cell  are  also  shown.  These  control  signals  are  all  generated  from 
the  four  control  bits  supplied  to  each  of  the  two  sections. 

4.6. 1.1  ALU  Design.  The  ALU  for  the  FPASP  is  a  highly  modified  version  of  the 
one  used  in  the  ASP.  The  new  circuitry  designed  for  the  FPASP  is  an  input  mux  to  allow  selection 
of  A  or  Abar,  and  the  linear  shifter  hardware.  Also,  the  “pass”  selection  was  removed  in  favor  of 
an  OR  with  zero  to  simplify  the  decoding  hardware.  The  pulldown  transistor  for  the  distributed 
NOR  gate  which  produces  the  zero  flag  was  also  incorporated  in  the  ALUs. 

The  operations  which  can  be  performed  by  the  FPASP  ALU  are  listed  in  Table  4.1,  along 
with  the  boolean  or  arithmetic  function  chosen  to  implement  each  one.  The  negative  logic  functions 
were  implemented  with  the  existing  ASP  logic  gates  by  using  the  negative  inputs  and  invoking 
DeMorgan’s  theorem; 

AR  =  A  -t-  5 

TT+B  =  a  ■  B 

Thus,  the  negative  logic  functions  have  been  added  without  adding  more  gates,  just  the  Abar 
choice  on  the  input. 

The  arithmetic  functions  are  performed  in  2’s  complement  fashion  using  a  carry-select  adder. 
This  type  of  adder  circuit  was  required  for  speed.  The  traideofr  is  that  it  uses  twice  as  much 
hardware  as  a  ripple-carry  adder. 


4-17 


Function 

Operation 

Flags  affected 

CARRY,  OVERFLOW,  SIGN,  ZERO 

MOVN 

A 

none 

OR 

A  B 

zero 

AND 

A  B 

zero 

XOR 

A  e  B 

zero 

MOV 

A 

zero 

NAND 

A  +  B 

zero 

NOR 

A'ff 

zero 

NOT 

A 

zero 

INC 

A  +  0  -hi 

All  Four 

SET 

A  -hB  -hi 

All  Four,  Sets  carry 

ADC 

A  -h  B  -h  previous  carry 

All  Four 

ADD 

A  B  -f  0 

All  Four 

NEGA 

A  +0  1 

All  Four 

SUB 

A  +  fl  1 

All  Four 

SWB 

A  -h  B  -h  previous  borrow 

All  Four 

DEC 

A  +  1  -hO 

All  Four 

Table  4.1.  ALU  Operations. 


A  carry-select  adder  works  by  computing  the  sum  for  both  possible  carry  ins.  When  the 
actual  carry  in  becomes  known,  the  proper  sums  from  that  stage  are  selected.  The  previous  carry 
also  selects  the  corresponding  proper  carry  out  of  that  stage,  which  is  then  used  to  select  the  sums 
of  the  next  stage  in  the  adder.  The  stages  used  for  these  carry  select  muxes  are  shown  in  Figure 
4.13. 

The  number  of  ripple  delays  can  increase  in  the  lower  stages  because  the  proper  carry  takes 
longer  to  get  to  them.  The  only  carry  ripple  delay  which  is  on  the  critical  path  is  the  one  in  the 
first  stage,  which  is  only  four  adders  deep.  For  this  first  stage,  the  carry  in  of  the  “previous”  stage 
is  chosen  by  the  operation  being  performed. 

There  are  four  choices  for  this  carry  in  bit;  the  carry  of  the  previous  adder  operation,  its 
inverse,  a  T’,  or  a  ‘O’.  The  choice  required  for  each  operation  is  shown  in  Table  4.1 


4.6. 1.2  Shifter  Design.  The  ALU  output  feeds  directly  into  the  Shifter.  The  Shifter 
can  perform  left  or  right  shifts  with  eight  choices  of  input  sources.  It  can  also  just  pass  the  ALU 
output  to  the  C  bus,  or  ground  all  of  the  C  bus  lines.  These  functions  are  listed  in  Table  4.2,  along 
with  the  shifted  in  bit  needed  for  each  one.  The  bit  shifted  out  is  available  as  a  flag  on  the  next 
clock  cycle. 
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Function 

Shift-out  saved 

NOP 

No  Drive 

none (LSB) 

no 

GNDC 

Ground  C  bus 

none  (c^ury  flag) 

no 

PASS 

No  shift 

none  (sign  flag) 

no 

Left 

shift-out  flag 

yes 

Left 

MSB 

yes 

Left 

present  Carry 

yes 

Left 

0 

yes 

SLl 

Left 

1 

yes 

SRLS 

Right 

LSB 

yes 

SRCF 

Right 

Carry  flag 

yes 

SRS 

Right 

Sign  flag 

yes 

SROT 

Right 

shift-out  flag 

yes 

SRSE 

Right 

MSB 

yes 

SRCY 

Right 

present  Carry 

yes 

SRO 

Right 

0 

yes 

SRI 

Right 

1 

yes 

Table  4.2.  Shifter  Operations. 


An  8:1  mux  is  used  to  select  the  shifted  in  bit.  The  mux  output  goes  to  both  ends  of  the 
shifter,  but  there  are  fewer  choices  that  are  useful  for  shifting  in  from  the  left  (into  the  LSB)  than 
from  the  right.  For  example,  one  of  the  choices  available  is  the  most  significant  bit  or  MSB.  This 
bit  is  useful  for  left  circular  shifts  or  for  a  shift  right  with  a  sign  extension.  On  the  other  hand,  the 
least  significant  bit  (LSB)  is  useful  only  for  right  circular  shifts;  there  is  not  much  need  for  a  “left 
shift  with  LSB  extension.” 

The  choices  of  shifts  are  based  on  the  ones  in  Mano  text.  These  turned  out  to  be  sufficient 
for  the  microcode  written  so  far.  The  only  thing  the  EE588  class  needed  in  addition  to  what  was 
provided  originally  in  the  ALU/Shifter  was  the  ability  to  see  the  shifted  out  bit,  which  was  done 
by  sending  it  to  the  condition  mux  in  the  control  section. 

4.6. 1.3  Flag  Sources.  Tables  4.1  and  4.2  show  which  functions  set  tlie  various  flags. 
These  flags  are  latched  into  master/slave  flip-flops  which  drive  the  flag  to  the  rest  of  the  circuitry  on 
the  next  clock  cycle.  This  latch  allows  the  flags  to  be  held  until  they  are  reset,  but  it  introduces  a 
one  clock  cycle  delay  from  when  the  flags  are  set  to  when  they  can  be  checked.  The  only  exception 
is  the  carry  into  the  shifter,  which  has  an  input  for  the  carry  generated  by  the  present  function, 
and  an  input  for  the  carry  saved  in  the  flag  register. 

The  one  cycle  delay  in  the  flag  bit  is  also  required  because  there  is  not  enough  time  in  one 
clock  cycle  for  a  flag  bit  from  the  ALU  to  be  generated  and  ripple  through  the  condition  mux  in 
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time  to  be  valid  for  br2mching  on  in  the  next  clock  cycle.  The  effects  of  the  one  cycle  delay  on  the 
microcode  were  seen  the  microcode  example  for  the  incrementable  registers. 

The  four  flags  from  the  ALU  are  the  ones  used  in  the  ASP.  The  first  flag  is  the  zero  flag, 
which  is  raised  when  the  value  out  of  ALU/Shifter  is  zero.  This  condition  is  checked  before  the 
value  is  driven  onto  the  C  bus,  so  the  flag  can  be  set  without  affecting  the  C  bus.  The  Carry 
flag  is  the  carry  bit  out  of  the  last  stage  of  the  carry-select  adder.  If  one  of  the  three  subtraction 
operations  is  being  done,  the  bit  coming  out  of  the  adder  is  inverted,  becoming  a  borrow  out.  The 
carry  out  goes  directly  to  the  shift-in  mux,  so  it  can  be  used  on  this  clock  cycle.  It  also  goes  to  a 
master-slave  flip-flop  to  be  the  carry  flag. 

The  sign  flag  is  just  the  MSB  of  the  number  out  of  the  ALU/Shifter.  The  overflow  flag  is 
the  XOR  of  the  carry  out  of  the  adder  and  the  carry  into  the  MSB’s  adder  cell.  This  is  the  most 
costly  flag  to  generate  in  terms  of  added  hardware.  To  get  the  true  carry  into  the  MSB  of  the  adder 
requires  that  a  carry-  select  mux  cell  be  put  there,  and  at  the  end  of  the  adder,  to  get  the  true 
carry  out.  This  is  why  the  adder  ends  in  two  mux  cells  rather  than  just  one. 

The  shifted  out  bit  flag  is  the  LSB  or  MSB  of  the  shifter,  depending  on  whether  a  left  or 
right  shift  is  being  performed.  No  new  control  signals  had  to  be  generated  to  load  this  flag;  the 
shift-right  and  shift-left  can  be  used  directly.  This  means  that  a  pass  or  grounding  the  C  bus  will 
not  affect  the  shift  out  flag. 

4. 6. 1.4  Simplification  of  Decoders.  The  design  of  a  macrocell’s  control  decoder  is 
greatly  influenced  by  how  the  possible  functions  it  can  perform  are  arranged  with  respect  to  the 
microcode  control  bits.  By  aligning  the  necessary  control  inputs  into  groups  with  similar  control 
bit  states,  the  final  boolean  equation  for  those  control  signals  can  be  reduced.  This  arrangement  of 
functions  into  common  blocks  corresponds  to  grouping  them  into  contiguous  spaces  on  a  Karnaugh 
map. 

The  arrangement  of  choices  made  for  the  ALU  control  fields  to  simplify  the  decoding  can  be 
seen  in  table  4.1.  With  the  added  ability  to  choose  A  or  Abar  comes  the  ability  to  produce  the 
negative  logic  functions  NAND,  NOR  and  NOT,  which  were  not  available  in  the  ASP. 
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This  not  only  adds  flexibility  to  the  FPASP  ALU,  but  also  allows  some  choices  for  the  way 
these  functions  are  carried  out.  For  example,  the  NOT  function  can  be  done  five  different  ways; 

A  ~  A  move 
A  =  1+0  OR 
A  =  Al  AND 
A  =  A0O  exlusive  OR 
A  =  A01  exlusive  OR 


The  last  method  was  chosen,  because  after  all  the  other  functions  were  grouped  for  easy 
decoding,  the  NOT  function  was  left  in  the  group  of  control  functions  which  used  XOR  and  an 
input  of  A.  Table  4.3  also  shows  how  each  logic  function  in  the  ALU  is  used  at  least  twice.  This 
arrangement  simplified  the  decoding  of  every  control  bit.  The  only  control  bit  which  could  not  be 
fit  into  symmetric  groups  was  the  A  input  select. 


Control 

ALU 

Function 

Inputs  Selected 

Bits 

Operation 

in 

• 

o 

m 

lEl 

n 

Carry  In 

0000 

MOVN 

n 

■ 

■ 

il 

0 

don’t  care 

0001 

OR 

D 

B 

B 

B 

don’t  care 

0010 

AND 

H 

X 

■ 

B 

B 

B 

don’t  care 

0011 

XOR 

■ 

B 

B 

B 

B 

don’t  care 

0100 

MOV 

n 

■ 

■ 

■ 

n 

0 

don’t  care 

0101 

NAND 

B 

■ 

B 

B 

A 

F 

don’t  care 

0101 

NOR 

B 

Q 

B 

B 

A 

B 

don’t  care 

0111 

NOT 

B 

■ 

B 

B 

in 

1 

don’t  care 

1000 

INC 

■ 

■ 

■ 

Ol 

n 

0 

1 

1001 

SET 

■ 

■ 

B 

B 

B 

B 

1 

1010 

ADC 

B 

■ 

B 

B 

B 

B 

Carry  Flag 

1011 

ADD 

B 

■ 

B 

B 

B 

B 

0 

1100 

NEGA 

■ 

■ 

■ 

X 

n 

0 

1 

1101 

SUB 

B 

■ 

B 

X 

B 

B 

1 

1110 

SWB 

B 

■ 

B 

X 

B 

B 

Carry  Flag 

nil 

DEC 

B 

■ 

B 

X 

B 

1 

0 

Table  4.3.  ALU  Decoding  Patterns. 


Table  4.2  showed  the  mux  selections  for  all  of  the  operations.  The  ones  which  don't  need  a 
shift-in  bit  still  cause  a  mux  selection,  but  it  is  not  used.  These  selections  are  shown  in  parentheses. 
The  same  mux  selection  is  made  for  the  upper  and  lower  halves  of  the  table.  This  eliminates  any 
decode  circuitry  for  the  mux  selection,  the  decoding  is  done  by  the  mux  itself,  using  the  three  LSB’s 
of  the  control  bits. 
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The  decoding  for  the  type  of  shift  is  done  by  the  control  MSB,  which  selects  a  left  or  right 
shift.  The  only  decoders  needed  are  the  ones  to  choose  the  first  three  operations,  which  take  replace 
the  left  shifts  with  the  less  useful  shift-in  bits.  So,  although  the  shifter  choice  “PASS”  is  decoded 
by  the  shifter  input  mux  to  select  the  sign  flag  bit,  there  is  no  effect  on  the  output,  because  that 
choice  is  also  decoded  to  cause  no  shift  of  the  bit  stream  out  of  the  ALU. 

4.6.2  Barrel  Shifter.  The  barrel  shifter  was  one  ASP  macrocell  that  could  be  used  almost 
directly.  Capt.  Gallagher’s  design  of  the  components  made  it  simple  to  reverse  engineer  the  ASP 
barrel  shifter  and  use  the  exact  same  cells  in  a  larger  shifter  for  the  FPASP.  Even  the  PLA  decoder 
could  eeisily  be  expanded  to  its  full  32  bit  width.  The  barrel  shifter  performs  only  left  circular 
shifts,  but  they  are  done  in  a  single  clock  cycle. 

Only  the  C  bus  drivers  were  enlarged  a  little  to  drive  the  longer  and  more  heavily  loaded  lines 
in  the  FPASP.  A  transistor  sizing  change  was  also  maule  to  the  PLA  decoder  for  the  “no  drive” 
option. 

4.6.2.}  Control  Sources.  A  way  to  insert  control  bits  from  the  datapath  was  needed 
for  one  of  the  microcode  projects.  This  was  provided  by  loading  the  5  LSB’s  of  the  lower  C  bus 
into  a  set  of  flip-flops.  A  mux  allows  the  control  bits  from  the  ROM  or  the  control  bits  stored  in 
the  register  to  be  used  by  the  shifter  control  PLA.  This  is  the  same  method  used  to  support  the 
assembly  language,  but  on  a  smaller  scale. 

4-6.3  Literal  Inserter.  The  Literal  Inserter  was  one  of  the  first  cells  to  be  designed  for 
the  FPASP.  This  was  because  it  originally  was  to  serve  also  as  the  A  bus  tie.  This  turned  out  to 
be  unnecessary,  and  later  impractical,  when  the  new  floorplan  pul  the  control  section  on  the  other 
side  of  the  chip  from  the  A  bus  tie. 

The  Literal  Inserter  is  basically  the  pulldown  transistor  from  the  registers  cells  with  a  different 
input.  It  takes  its  input  from  a  16  bit  field  in  the  microcode  word.  It  can  put  those  16  bits  in 
several  places.  They  can  go  on  the  upper  or  lower  16  bits  of  either  the  upper  or  lower  or  both  A 
busses. 

The  16  bits  not  inserted  can  also  be  controlled.  The  two  choices  for  the  uninserted  bits  are  to 
either  ground  them  to  ‘O’,  or  to  leave  them  untouched  at  their  precharged  ‘1’  level.  Both  of  these 
choices  are  useful  in  different  situations.  The  most  common  uses  of  a  literal  inserter  are  to  ptit  a 
constant  or  a  mask  on  the  bus.  For  con.stants  the  MSBs  will  probably  be  zeroes,  while  for  a  mask 
they  will  probably  be  left  as  ones. 


The  Literal  Inserter  has  its  own  control  held  to  drive  the  A  busses.  This  means  it  can  drive 
the  bus  at  the  same  time  as  a  register.  This  usually  means  a  problem  in  the  code,  but  it  can  be  used 
also  for  matsking  out  bits  of  a  register  word  while  it  is  still  on  the  A  bus.  So  the  data  can  be  masked 
and  then  operated  on  in  the  same  clock  cycle.  By  driving  two  signals  onto  the  precharged  A  bus, 
an  AND  operation  is  performed,  without  any  fighting  between  pullup  and  pulldown  transistors. 

The  subcells  which  make  up  the  Literal  Inserter  are  identical.  This  cell  is  shown  in  Figure 
4.14;  which  also  shows  the  simple  decoding  circuitry  which  resulted  from  the  arrangement  of  the 
operation  choices.  The  design  of  the  cell  and  the  arrangement  of  the  control  choices  simplified  the 
design  of  the  control  decoders  down  to  a  few  gates. 

The  cell  itself  is  just  a  pulldown  transistor  controlled  by  a  2.T  mux  and  an  enable  line.  One 
of  the  two  mux  choices  is  a  bit  made  up  from  the  control  bits,  which  says  whether  to  ground  the 
bus  or  not.  The  other  is  the  bit  out  of  the  microcode  word  to  be  inserted  on  the  bus. 

4.6.3. 1  Saving  the  Flags.  The  literal  inserter  is  used  by  the  routine  which  saves  the 
state  of  the  machine  when  an  internal  inconsistency  forces  the  machine  to  halt.  This  can  eitlier 
be  a  call  from  the  user’s  code,  or  a  forced  branch  from  an  internally  generated  signal.  The  trap 
routine  uses  the  literal  inserter  to  write  out  the  flags. 

This  was  made  possible  by  putting  the  literal  inserter  by  the  control  section.  Tlie  literal 
inserter  provides  an  easy  path  to  get  bits  out  of  the  control  section  and  onto  the  datapaths,  where 
they  can  be  written  out  to  memory.  The  number  of  literal  bits  is  also  conveniently  the  number  of 
inputs  to  each  bank  of  the  condition  mux,  which  is  described  later. 

The  three  fields  of  the  literal  inserter  that  are  not  used  for  regular  operations  are  decoded  to 
mux  the  three  banks  of  flags  in  place  of  the  literal  field.  In  this  case,  the  flags  can  only  be  sent 
to  the  upper  A  bus,  where  they  appear  either  on  the  upper  or  lower  16  bits.  The  16  unused  bit 
are  zeroed  out  or  left  alone  according  to  the  literal  inserter  controls.  It  makes  no  difference  in  this 
case  what  the  other  bits  are,  just  that  the  flags  are  saved  off  the  machine.  Of  course,  any  function 
in  the  microcode  is  available  to  the  user,  so  the  flags  could  be  saved  at  any  time  if  required.  This 
would  be  useful  for  self-test  code. 

4.6.4  Function  ROM.  The  function  ROM  is  a  small  version  of  the  LPROM  used  for  the 
microcode.  This  was  the  simplest  way  to  make  the  choice  of  seeds  user-programmable.  The  two 
functions  which  will  be  written  into  the  XROM,  square  root  and  invert,  will  have  fixed  seeds.  These 
values  can  be  “pre-zapped”  by  putting  the  diffusion  in  place  before  fabrication.  This,  way  the  seeds 
will  exist  without  having  to  be  put  there  by  the  laser. 
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The  block  diagram  of  the  Function  ROM  is  shown  in  Figure  4.15.  Since  the  ROM  only  puts 
out  five  bits,  the  PLA  cells  and  bitline  drivers  can  be  made  much  smaller  than  the  ones  needed  for 
the  microcode  store. 

The  Function  ROM  takes  in  three  control  bits  and  five  bits  from  the  datapath.  The  ROM 
puts  out  the  four  MSBs  of  the  seed’s  mantissa  and  the  LSB  of  the  seed’s  exponent.  The  three 
control  bits  choose  which  set  of  seeds  will  be  used,  and  the  datapath  bits  select  the  particular  seed. 


{; 


Control  bits 


To  upper  C  bus 

mantissa  MSBs  exponent 
/= — ^ ^  LSB 


Addr 

decode 

PLAs 


Exponent  LSB 


7:1  Muxes 


Sense  Amps 


2:1  MUXes 


LPROM 


32  X  5  bits  X  7 


H _ AO  drivers 


Mantissa  MSB's 


NOP 


Figure  4.15.  Function  ROM  Block  Diagram. 
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The  number  of  tables  can  be  increased  for  functions  which  do  not  require  the  seed’s  exponent 
bit.  This  can  be  done  by  having  the  routine  set  the  ‘exponent’  input  bit,  and  then  using  only  the 
four  mantissa  MSBs  to  address  into  the  table.  Similarly,  the  number  of  seeds  for  a  function  could 
be  increased  by  taking  the  five  MSBs  of  the  mantissa  and  shifting  them  left  one  bit. 

The  upper  B  bus  provides  the  input  bits,  and  the  result  is  driven  onto  the  upper  C  bus.  Only 
the  five  lines  corresponding  to  the  mantissa  MSBs  and  exponent  LSB  are  driven.  The  rest  of  the 
bits  must  be  set  by  another  source,  usually  the  literal  inserter. 

Microcode  written  for  the  Newton-Raphson  inversion  routine  is  listed  in  Appendix  B.  This 
code  shows  how  the  Function  ROM  is  used  to  generate  the  seed  MSBs,  and  how  the  literal  inserter 
is  used  to  create  the  final  floating  point  seed. 

^.6.5  Floating  Point  Hardware. 

4.6.5. 1  Multiplier.  The  multiplier  uses  Booth’s  octal  encoding  and  a  central  core  of 
adders  to  do  the  mantissa  multiplication  [Fre88].  The  A  bus  input  is  multiplied  by  three,  then  that 
result  ana  the  original  number  are  fed  into  the  array.  The  B  bus  input  is  octally  encoded  according 
to  Booth’s  algorithm.  The  A  number  times  -4,-3,-2,-l,0, 1,2,3  or  4  must  be  added  to  the  partial 
product  depending  on  the  encoding  of  the  B  number.  Once  the  multiplying  factor  is  known,  one 
of  the  two  A  inputs  can  either  be  added  directly,  shifted  once  and  added,  or  the  2’s  complement  of 
the  number  could  be  shifted  and  added. 

As  the  partial  product  reaches  the  bottom  of  the  array,  a  110  bit  tree  carry-select  adder  is 
used  to  form  the  final  product.  The  final  product  is  rounded  off  and  combined  with  the  exponent, 
which  was  calculated  in  a  separate  circuit.  That  circuit  forms  the  exponent  by  adding  the  two 
input  exponents,  taking  into  account  the  bias  of  1024.  The  equation  of  the  final  exponent  is  EXP 
A  +  EXP  B  -  1023. 

The  core  adder  array  can  also  be  configured  to  multiply  32  bit  integers,  giving  a  full  (34  bit 
result.  The  particular  operation  to  be  performed  is  signalled  to  the  multiplier  when  the  data  is 
loaded  into  the  input  registers. 

Flags  required  by  the  IEEE  standard  are  also  generated  and  output  either  as  distinct  flags 
to  the  condition  mux,  or  encoded  into  the  result.  The  overflow  flag  is  also  used  to  indicate  if  an 
integer  multiply  result  has  MSBs  on  the  upper  datapath. 

4. 6. 5. 2  Adder/Subtractor.  The  floating  point  adder/subtractor  needs  only  a  single 
carry-select  adder  like  the  one  in  the  ALU  to  form  the  mantissa  of  the  result.  The  exponents  require 
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more  manipulation  than  in  the  multiplier,  though.  In  the  case  of  the  adder,  the  exponents  must 
be  made  equal  before  the  mantissas  can  be  added  together. 


The  difference  between  the  exponents  is  used  to  tell  a  barrel  shifter  how  far  to  rotate  one  of 
the  mantissas  before  they  are  added  together.  Once  the  exponents  are  aligned,  the  exponent  of  the 
result  is  known.  If  the  difference  between  the  exponents  is  so  great  that  the  smaller  mantissa  is 
barrel-  shifted  completely  away,  a  flag  is  raised. 

4. 6. 5. 3  Interface  and  Timing.  The  interface  to  the  rest  of  the  FPASP  is  identical 
for  both  of  the  floating  point  processors.  The  input  data  is  loaded  into  a  master-slave  flip-flop,  and 
computation  begins  at  the  start  of  the  following  clock  cycle.  At  the  end  of  the  clock  cycle  following 
that,  the  circuitry  has  settled  and  can  be  enabled  to  drive  either  the  B  or  C  busses.  Control  of 
bus  access  is  by  the  same  decoders  used  for  the  registers.  The  multiplier  and  adder  are  treated  just 
like  registers  when  it  comes  to  output.  Figure  4.16  shows  the  interface  hardware  common  to  both 
floating  point  processors. 

The  main  difference  between  these  circuits  and  the  registers  is  that  the  B  bus  is  driven  instead 
of  merely  being  pulled  down  or  left  charged.  This  allows  the  full  clock  cycle  to  be  used,  otherwise 
the  circuit  would  have  to  settle  before  the  end  of  precharge  on  the  cycle  that  the  drive  control  for 
the  B  bus  is  issued.  If  the  B  bus  were  not  driven  it  might  be  discharged  accidentally  while  the 
circuitry  settled  in  the  time  after  the  fall  of  precharge. 


I 


Figure  4.16.  Floating  Point  Hardware  Interface. 
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Figure  4.17  shows  the  operating  cycle  for  both  the  multiplier  and  adder.  Once  they  have  been 
loauled  the  first  time,  they  can  be  loaded  and  have  their  previous  results  driven  out  at  the  same 
time  every  two  clock  cycles.  If  the  input  is  loaded,  the  operation  will  be  started  on  the  next  clock 
cycle  and  the  results  will  begin  changing.  If  the  inputs  are  not  loaded,  the  results  can  be  held  as 
long  as  needed. 


Figure  4.17.  Floating  Point  Hardware  Timing. 

Each  of  the  floating  point  processors  has  a  set  of  flags  which  can  be  checked  for  either  the 
true  or  false  condition.  In  addition,  there  is  a  flag  which  is  a  combination  of  the  overflow  and  not-a- 
number  flags  for  a  single  check  of  any  abnormal  results.  This  is  the  flag  used  in  loops  where  there 
are  not  enough  lines  of  code  to  check  each  condition  separately.  The  flags  are  valid  on  the  clock 
cycle  after  the  results  are  driven  out.  The  flags  remain  unchanged  until  the  next  floating  point 
operation.  All  of  the  flags  available  are  listed  in  the  microcode  definition  in  Appendix  A. 
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7  Microsequencer 


A  detailed  block  diagram  of  the  microsequencer  is  shown  in  Figure  4.18.  This  is  the  most 
important  and  complex  part  of  the  FPASP,  and  went  through  the  most  revisions.  The  basic 
sequencer  is  still  the  one  from  the  Mano  text,  but  it  no  longer  uses  any  of  the  cells  from  the  ASP 
except  the  registers  in  the  stack.  Major  changes  include  the  extension  of  the  stack  into  external 
memory,  and  the  mapping  ROM  needed  for  the  assembly  language. 

4.7-1  Branch  Control.  The  flow  of  the  microcode  program  is  controlled  by  the  choice  of 
addresses  into  the  XROM/LPROM  microcode  store.  This  address  comes  from  the  Control  Address 
Register  (CAR).  The  address  to  be  loaded  into  the  CAR  is  selected  from  four  choices,  based  on  the 
input  from  the  control  word,  and  control  signals  generated  within  the  FPASP.  Each  of  the  choices 
has  its  own  source  and  control  logic  for  selection. 

The  decision  of  which  address  to  load  is  made  by  the  branch  control  logic  which  drives  the 
4-to-l  mux  at  the  input  to  the  CAR  register.  Three  of  the  choices  are  conditional:  branch  (BR  ), 
return  from  a  subroutine  (ret  ),  and  call  a  subroutine  (call  ). 

BRANCH  and  CALL  take  their  next  address  from  the  microcode  word.  These  bits  are  the  10 
LSBs  of  the  field  used  for  the  literal  inserter,  so  that  device  cannot  be  used  on  the  same  line  as  one 
of  these  two  conditional  branches.  The  return  lakes  its  address  from  the  stack,  where  it  has  just 
been  ‘popped’  to  the  top. 

The  address  from  either  the  microcode  word  or  the  stack  is  loaded  only  if  the  specified  condi¬ 
tion  is  true.  If  the  condition  is  not  true,  the  default  address  for  all  three  branches  is  the  incremented 
value  of  the  CAR.  This  is  the  address  of  the  next  sequential  instruction  in  the  microcode.  For  this 
reason,  two  of  the  choices  into  the  condition  mux  are  an  unconditional  true  and  false. 

If  a  branch  must  be  made,  then  the  unconditional  true  is  chosen.  If  the  code  is  to  be  done 
sequentially,  the  only  way  to  specify  the  next  address  is  to  use  one  of  the  branch  conditions  and 
choose  the  unconditionally  false  flag.  In  the  FPASP,  the  default  values  for  these  two  microcode 
fields  were  chosen  for  the  latter  case,  branch  and  unconditional  false.  The  program  will  proceed 
sequentially  if  no  value  is  put  into  these  two  fields  in  the  microcode. 

The  MAP  selection  is  not  conditional.  If  this  branch  is  selected,  it  will  occur  regardless  of  any 
flag  specified,  including  the  unconditional  true  or  false.  The  map  choice  is  used  for  two  purposes: 
to  support  the  assembly  language,  and  to  allow  the  FPASP  to  do  a  hardware  call  to  the  trap 
routine. 
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Figure  4.1 

.8.  Microsequencer  Details.  ® 
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The  latter  will  occur  if  an  inconsistency  is  found  in  the  stack  control  signals.  The  conditions 
for  this  to  occur  can  be  seen  in  the  control  logic  equations  in  Table  4.4.  The  conditions  checked  for 
are  whether  a  call  is  being  made  when  the  external  stack  is  full,  or  whether  a  return  is  being  made 
when  the  internal  stack  is  empty. 


Source  of 

Conditions 

Next  Address 

for  Selection 

Incrementer 

COND  fBR  4-  RET  +  CALL) 

Stack 

CONDRET 

ROM  field 

C0ND  (BR  -)-  CALL) 

Mapping  ROM 

MAP  -1-  STKTRAP 

STKTRAP 

C0ND  (CALL  MSF  +  RET  STKE) 

(goto  trap) 

(these  are  inconsistent  conditions) 

Table  4.4.  Next  Address  Sources  for  CAR. 

Other  inconsistencies  were  originally  checked,  such  as  whether  the  internal  stack  was  indicat¬ 
ing  it  was  empty  when  the  external  stack  was  not.  But  there  are  many  of  these  less  likely  conditions, 
and  to  check  them  all  would  delay  the  control  bit  too  long.  So  the  only  conditions  checked  are  those 
which  will  arise  if  the  hardware  is  operating  correctly,  but  the  software  has  exceeded  its  capabilities. 

The  box  labelled  branching  logic  also  provides  the  control  signals  for  all  of  the  rest  of  the 
microsequencer.  These  include  the  signals  for  controlling  the  stack  and  stack  pointers.  The  one 
control  not  produced  by  this  logic  cell  is  the  GO  signal.  This  comes  from  the  host  through  a 
dedicated  pin.  It  resets  the  CAR,  the  pointer  to  the  external  stack,  and  the  tag  field  of  the  stack 
to  zero. 

^.1.2  Branching  Latency.  The  latencies  caused  by  the  control  section  registers  can  best 
be  seen  in  the  selection  of  the  branch  condition.  This  is  illustrated  in  Figure  4.19.  The  control  bits 
which  select  the  flag  are  in  the  instruction  loaded  two  clock  cycles  ago.  This  is  why  the  instruction 
after  a  true  branch  condition  is  still  executed:  it  was  in  the  middle  of  the  pipeline  between  the 
instruction  which  selected  the  branch  and  the  instruction  which  the  branch  selected. 

The  microcode  ROMs  are  fast  enough  to  produce  the  condition  mux  selects  and  branch  selects 
in  time  to  choose  the  next  address  for  the  CAR  without  going  through  the  pipeline  registers.  This 
would  remove  the  problem  of  having  the  instruction  after  the  branch  executed.  The  reason  this  was 
not  pursued  is  that  the  ROMs  put  out  erroneous  control  bits  during  the  precharge  pulse.  These 
bits  could  cause  the  stack  to  be  popped  or  pushed,  and  the  state  of  stack  would  then  be  incorrect. 
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micro-address 


Figure  4.19.  Microsequencer  Latency 
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^.7.3  Condition  Mux.  The  condition  mux  collects  all  of  the  flags  generated  throughout 
the  FPASP.  The  mux  is  then  used  to  select  which  flag  will  be  the  branch  condition.  If  the  selected 
flag  is  a  T’,  then  the  condition  is  true.  The  select  lines  come  directly  from  the  microword  on  the 
control  bus. 

The  condition  mux  is  made  up  of  three  16-to-l  muxes  as  shown  in  Figure  4.20.  These  allow 
a  total  of  48  distinct  flags  to  be  selected  from.  The  6  bit  “Conditional  Multiplexer  Select”  field  of 
the  microword  allows  64  choices.  The  16  extra  selections  are  the  inverses  of  the  flags  fed  into  the 
first  of  the  16-to-l  muxes. 


0/32  15/47  16  31  48  63 


Figure  4.20.  Condition  Multiplexer. 

There  are  actually  more  than  16  flags  whose  inverse  can  be  used  as  a  branch  condition.  For 
these  extra  ones,  it  was  easiest  to  use  two  inputs  of  the  other  16-to-l  muxes.  Doing  it  this  way 
allowed  a  single  design  for  all  three  muxes.  and  it  simplified  the  decoding  of  the  control  bits.  With 
the  16-to-l  muxes  and  the  2-to-l  polarity  reversing  mux  arranged  as  shown,  no  decoders  are  needed 
at  all.  The  control  bits  from  the  microword  are  decoded  by  the  muxes  directly. 

7.3.1  Flag  Sources.  Some  of  the  flags  do  not  come  from  the  datapatli.  Two  of  the 
flags  come  directly  from  pins.  These  are  the  loi,  loa  flags.  They  are  provided  for  testing  and  can 
also  be  used  for  interrupt  flags. 
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Other  flags  which  do  not  come  from  the  datapath  are  the  unconditional  true  and  false.  These 
are  formed  by  grounding  the  first  input  of  the  16-to-l  mux  that  has  the  polarity  reversal  mux  after 
it.  The  default  condition  mux  control  word  is  oooooo,  which  is  the  false  condition. 

Another  pair  of  flags  are  the  ‘even’  flags.  These  flags  come  from  the  datapaths,  but  not  from 
the  processing  hardware.  They  are  merely  the  inverted  LSBs  of  the  C  busses.  This  flag  was  added 
at  the  request  of  one  of  the  EE588  groups  so  they  could  easily  check  whether  the  integer  on  the 
bus  WM  even  or  odd.  If  the  number  is  even,  this  flag  will  be  ‘true.’ 

These  LSBs  are  passed  through  master-slave  flip-flops  to  ensure  they  will  be  valid  in  time 
to  affect  the  branch  logic.  This  also  means  they  are  not  available  until  one  clock  cycle  after  they 
generated,  just  like  all  the  other  flags.  The  reason  for  this  is  that  the  worst  case  delay  through 
the  ALU/Shifter,  combined  with  all  the  delays  in  the  branch  logic,  are  longer  than  the  clock  cycle, 
which  was  made  long  enough  only  for  the  worst  case  delay  of  the  ALU/Shifter  alone. 

The  even  flags  must  be  checked  on  the  cycle  after  being  generated  or  they  will  be  lost.  This  is 
because  they  are  not  held  in  the  flip-flops,  only  delayed  by  them.  The  flip-  flops  latch  in  the  LSBs 
of  the  C  busses  on  every  clock  cycle  because  no  control  bits  are  available  to  control  them.  So  the 
flag  only  remains  valid  until  C  bus  is  driven  by  another  bit. 

The  lot,  102  and  the  unconditionally  true  and  false  flags  are  the  only  ones  which  do  not  pass 
through  flip-  flops  before  reaching  the  condition  mux.  For  the  lO  flags,  this  means  they  cannot  be 
saved  by  the  trap  routine  unless  they  are  valid  when  the  first  group  of  flags  are  put  opt  by  the 
literal  inserter. 

■f-7-4  Mapping  ROM.  The  mapping  ROM  (MAP)  uses  the  opcode  of  the  a-ssembly  lan¬ 
guage  instruction  in  the  lower  R1  to  select  the  micro-ROM  address  of  the  routine  which  supports 
that  instruction.  The  MAP  is  an  LPROM  similar  to  the  one  used  for  the  Function  ROM.  This 
allows  the  user  to  map  the  unused  opcodes  to  the  addresses  of  microcode  routines  for  new  a.ssembly 
instructions. 

There  are  32  possible  micro-addresses  in  the  MAP,  corresponding  to  the  32  combinations  of 
the  5  opcode  bits.  Ten  of  these  are  used  for  the  fixed  assembly  language  instructions.  The  zero 
address  is  used  to  point  to  the  trap  routine,  so  the  machine  can  make  a  hardware  call  to  save 
the  state  of  the  machine.  That  address  is  also  accessible  from  the  microcode  by  clearing  the  lower 
Rl’s  5  LSBs  and  using  the  map  branch  instruction. 
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To  do  the  hardware  call  the  microsequencer  branch  logic  closes  the  gates  from  the  lower  R1 
and  grounds  the  control  inputs.  To  keep  the  LPROM  from  being  addressed  on  every  clock  cycle 
and  thus  dissipating  power,  the  map  signal  is  used  as  one  of  the  inputs  to  the  FLA  cells. 

An  LPROM  was  chosen  for  the  MAP  since  it  is  the  only  laser  programmable  cell  available  at 
this  time.  The  MAP  is  organized  as  32  by  10  bits.  The  fixed  addresses  in  the  MAP  are  fabricated 
with  the  diffusion  already  in  place,  just  like  the  fixed  seeds  in  the  Function  ROM. 

If.!. 5  Stack.  The  stack  is  an  array  of  master-slave  flii>-flops  11  columns  wide  by  16  rows 
deep.  Each  flip-flop  in  a  column  can  load  the  output  of  the  flip>-flop  above  or  below  it.  This  forms 
a  last-  in,  first-out  (LIFO)  stack.  Extra  hardware  has  been  added  to  extend  the  stack  into  the 
external  memory  when  it  has  filled  up.  The  arrangement  of  flip-flops  is  shown  in  Figure  4.21. 


Figure  4.21.  Stack  Register  Connections. 

The  stack  holds  the  return  addresses  of  subroutines.  When  a  call  instruction  is  issued  and 
the  condition  is  true,  the  address  of  the  subroutine  is  loaded  into  the  CAR,  and  the  address  tliat 
was  in  the  CAR  previously  is  incremented  and  ‘pushed’  onto  the  stack. 

When  the  subroutine  is  done,  it  issues  a  ret  instruction  and  chooses  a  true  condition.  This 
causes  the  CAR  mux  to  pick  the  address  off  the  top  of  the  stack.  At  the  same  time,  the  stack 
flip-flops  load  the  output  of  the  flip-flop  below  them,  ‘popping’  each  of  the  stored  addresses  up  one 
row. 
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4.7.5. 1  Extension  to  Memory.  Extending  the  stack  into  the  external  memory  re¬ 
quired  additional  circuitry  to  keep  track  of  the  addresses  of  the  micro-addresses  put  into  the  external 
memory,  and  to  tell  when  to  read  or  write  that  data.  Figure  4.22  shows  the  hardware  associated 
with  the  stack  extension.  The  addresses  are  produced  by  am  up/down  counter  called  the  stack 
pointer.  The  signals  controlling  the  read  amd  write  come  from  the  bramch  logic  and  from  the  tag 
coluirm  of  stack  flip-flops. 

The  tag  column  is  used  to  keep  track  of  how  deep  the  stack  has  been  pushed.  When  the  GO 
pin  is  raised  by  the  host,  this  column  of  flip-flops  is  reset  to  zero.  The  input  to  the  top  of  the  tag 
column  is  always  a  1. 

When  an  address  is  pushed  onto  the  stack,  the  1  tag  is  pushed  on  with  it.  As  the  stack  fills, 
the  I’s  eventually  reach  the  lawt  flip-flop  in  the  column.  This  becomes  a  signal  to  the  control  logic 
that  the  next  push  will  cause  the  stack  to  overflow.  At  the  bottom  of  the  tag  column,  a  1  or  0  is 
popped  in,  depend'ng  on  whether  or  not  the  external  stack  is  full.  A  zero  is  loaded  if  tlie  stack 
pointer  is  zero. 

If  the  tag  says  the  stack  is  full  and  another  push  occurs,  the  control  logic  activates  a  set  of 
muxes  to  extend  the  stack  into  the  external  memory.  The  address  pushed  out  of  the  stack  is  muxed 
onto  the  10  LSBs  of  the  lower  data  pads,  the  address  in  the  stack  pointer  is  muxed  onto  the  lower 
address  pads,  and  the  lower  write  enable  pin  is  dropped.  This  writes  the  stack  address  into  the 
lowe:  Memory  at  address  zero.  The  stack  pointer  is  then  incremented  to  point  to  tlie  next  address. 

If  there  are  stack  addresses  in  the  external  memory  when  the  stack  is  popped,  the  decremented 
pointer  address  is  muxed  onto  the  address  pads,  but  in  this  case  the  write  enable  pin  is  left  at  its 
default  high  level,  so  a  read  is  performed.  The  data  is  muxed  from  the  pads  directly  into  the  bottom 
of  the  stack. 

This  takeover  of  the  address  and  data  pads  by  the  stack  means  the  microcode  cannot  be 
trying  to  do  a  read  or  write  on  the  same  line  as  a  call  or  return.  This  does  not  apply  to  routines 
which  will  not  cause  the  stack  to  overflow,  but  the  user  must  make  sure  those  routines  are  not 
called  by  others  that  might  cause  an  overflow. 

4. 7. 5. 2  Stack  Pointer.  The  stack  pointer  consists  of  two  registers:  one  with  an 
incrementer,  and  the  other  with  a  decrementer.  It  also  has  two  logic  gates  for  determining  whether 
it  is  full  or  empty.  The  hardware  is  shown  in  Figure  4.23. 

The  two  registers  work  in  a  master-slave  fashion.  The  master  is  the  register  with  the  incre¬ 
menter.  The  address  which  is  sent  to  the  pads  is  determined  by  whether  a  pop  or  push  is  being 
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Figure  4.22.  Stack  Extension  to  Memory. 


MSE  =  Memory  Stack  not  Empty 
set  if  any  register  bit  =  1 

MSF  =  Memory  Stack  Full 

set  if  all  register  bits  =  1 


SPT+  =  Increment  stack  pointer 
(push  microaddress  into  memory) 

SPT —  decrement  stack  pointer 
(pop  microaddress  from  memory) 


Figure  4.23.  Stack  Pointer  Circuitry. 
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done.  To  keep  track  of  the  proper  addresses,  the  master  register  always  loads  whichever  address 
was  sent  to  the  pads.  The  slave  register  always  loads  and  decrements  whichever  address  is  being 
put  out  by  the  master  register. 

The  slave  is  a  register,  so  there  is  a  one  clock  cycle  delay  from  when  it  loads  in  the  decremented 
value  and  when  it  can  drive  it  out  again.  This  is  not  a  problem  because  two  returns  cannot  occur 
one  after  the  other  due  to  the  latency  in  the  control  pipeline. 

Two  gates  determine  if  the  external  stack  is  full  (MSF)  or  empty  (MSE).  The  MSF  flag  is 
used  to  generate  a  trap  condition  if  a  push  is  attempted  after  the  external  stack  is  full.  The  MSE 
flag  is  used  to  feed  the  tag  column.  If  the  external  stack  is  not  empty,  a  1  is  popped  into  the  tag. 
The  bit  into  which  the  1  is  popped  is  itself  a  flag;  stack  full  (STKF).  Thus  the  internal  stack  reads 
full  as  long  as  the  external  stack  is  not  empty.  When  the  externed  stack  is  empty,  O’s  are  popped 
into  the  tag  column.  When  these  O’s  reach  the  top  of  the  stack  they  indicate  that  the  entire  stack 
is  empty. 

With  a  register  10  bits  wide,  a  total  of  1023  e.xternal  memory  words  can  be  used  for  stack 
values.  The  stack  values  go  into  the  10  LSBs  of  each  memory  word,  and  it  does  not  matter  what 
goes  into  the  other  bits.  For  the  addresses,  the  10  LSB’s  are  set  by  the  pointer,  and  the  10  MSBs 
are  set  to  zero.  Only  1023  words  can  be  addresses  instead  of  1024  because  on  the  last  push  With 
this  type  of  pointer,  the  external  stack  can  be  made  any  depth,  as  long  as  the  incrementcr  and 
decrementers  can  settle  in  one  clock  cycle.  The  stack  pointer  is  reset  to  zero  by  the  go  signal  from 
the  host. 


4.7. 5. 3  Reserved  Memory  Locations.  The  stack  and  the  trap  routine  require 
some  reserved  space  in  the  external  memory.  The  easiest  part  of  memory  for  the  hardware  to 
address  is  the  lowest  part,  since  it  is  simple  to  clear  a  counter  to  zero  and  start  counting  up.  The 
first  word  of  the  upper  memory  is  used  to  tell  the  FPASP  if  it  has  been  programmed  when  the 
machine  starts  up.  It  is  overwritten  if  the  TRAP  routine  is  called. 

The  stack  writes  out  to  the  lowest  1023  addresses  of  the  lower  external  memory.  The  TRAP 
routine  writes  the  state  of  the  machine  into  the  lowest  addresses  of  the  upper  memory. 

4. 7.6  Microcode  Store.  The  microcode  is  stored  in  two  places  on  the  FPASP.  Fixed  code 
is  fabricated  in  the  XROM,  and  user-defined  code  is  written  into  the  LPROMs.  These  two  parts 
of  microcode  memory  share  a  contiguous  address  space,  so  the  microsequencer  does  not  care  where 
the  next  instruction  comes  from.  The  memory  is  organized  as  784  words  deep  by  128  bits  wide. 
There  are  actually  two  halves  of  the  microcode  ROM,  each  one  64  bits  wide. 
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There  are  640  words  of  XROM  and  144  words  of  LPROM.  The  address  lines  to  the  two  of 
them  decode  which  control  word  to  read  and  which  of  the  two  ROMs  to  read  it  from.  The  two 
types  of  ROM  are  completely  separate  electrically;  a  mux  is  used  to  select  which  one  will  output 
the  control  word.  Figure  4.24  shows  the  airrangement  and  interconnections  between  the  ROMs. 
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Figure  4.24.  Microcode  ROM  Arrangement. 


4 .7. 6.1  XROM.  The  XROM  has  been  described  in  other  papers,  so  it  will  not  be 
covered  in  great  detail  here  [Ros85]  [Lin88-2].  It  is  a  precharged  device  which  uses  the  presence  or 
absence  of  a  transistor  to  represent  data.  The  arrangement  of  four  transistors  around  a  common 
drain  gives  the  cell  its  name. 


Figure  4.25  will  be  used  to  show  how  the  XROM  works.  The  addresses  are  decoded  in  NAND 
gates  along  each  side  of  the  array.  While  the  decoding  is  being  done,  the  bitlines  are  precharged 
high.  When  precharge  ends,  the  wordline  associated  with  the  decoded  address  goes  high.  This 


wordline  goes  to  all  the  gates  of  the  transistors  for  that  word  and  the  word  next  to  it.  These  pairs 


of  transistors  have  their  source  tied  to  one  of  two  lines:  one  line  is  driven  by  the  address  LSB  and 
the  other  by  its  inverse.  The  line  which  is  low  decides  which  transistor  of  the  pair  is  the  actual 
data. 


To  Sense 
Amplifier 


Figure  4.25.  XROM  Cell  [Ros85] 

For  eeu;h  ‘T  in  the  data  there  is  a  transistor  on  the  corresponding  bitline  which  is  turned 
on  when  the  wordline  goes  high.  This  allows  the  precharge  on  the  bit  line  to  bleed  off  to  the  low 
LSB  line  through  the  transistor.  The  resulting  low  voltage  is  detected  by  the  sense  amplifier  and 
is  output  as  a  T.’ 

This  cell  was  designed  so  that  it  could  be  laid  out  by  a  silicon  compiling  program.  This 
program  lays  out  the  entire  ROM,  including  the  sense  amps  and  PLA  decoders,  .\nother  feature  of 
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the  progreun  is  that  it  can  take  the  table  o.  ROM  data  and  optimize  the  layout  to  reduce  the  dreun 
capacitance  on  the  bitlines.  This  program  has  been  modified  to  allow  it  to  choose  which  columns 
of  bits  it  will  re-order  in  the  optimization  process. 

This  last  feature  is  a  necessity  on  the  FPASP  because  the  control  lines  have  been  optimized 
for  minimum  length  and  minimum  routine  channel  area.  This  means  that  only  re-  arranging  of 
rows  must  take  place  within  the  XROM. 

Normally,  the  XROM  optimizer  rearranges  the  columns  (and  rows)  and  prints  out  a  mapping 
of  where  the  columns  ended  up.  This  rearrangement  of  the  columns  cannot  be  done  in  the  FPASP 
because  leaving  an  open  channel  of  busses  64  lines  wide  on  each  side  of  the  ROMs  for  de-scrambliug 
the  columns  would  require  too  much  area.  Also,  the  columns  have  been  selected  for  each  half  of  the 
ROM  and  ordered  within  each  half  so  that  they  are  as  close  as  possible  to  their  destinations.  This 
reduces  the  channel  area  required  and  also  reduces  the  number  of  crossovers  between  the  lines. 

The  only  optimizing  will  be  rearranging  the  rows  and  inverting  the  polarity  of  bitlines  which 
are  more  than  half  populated  with  transistors.  This  allows  the  transistors  to  represent  whichever 
value  there  is  fewer  of  on  that  bitline.  If  there  are  more  T’s  than  ‘O’s,  the  transistors  are  used  to 
represent  the  ‘O’s  instead,  and  the  output  of  that  sense  amplifier  is  inverted. 

4. 1.6. 2  Laser  PROM.  The  Laser  PROM  workson  the  same  principles  as  the  XROM, 
and  uses  many  of  the  same  cells  (TilSS).  The  LPROM  represents  data  the  same  way,  but  now  the 
presence  or  absence 

of  the  transistor  is  determined  by  a  laser.  Figure  4.26  shows  one  of  the  LPROM  transistors. 
The  transistor  is  fabricated  with  a  gap  in  the  diffusion.  If  the  transistor  is  to  be  put  into  the  circuit, 
a  laser  beam  is  shined  on  the  gap.  Intense  local  heating  causes  the  diffusion  on  either  side  to  spread 
into  the  gap.  When  they  join,  the  transistor  becomes  electrically  connected  to  the  bitline. 

One  of  the  reasons  there  is  less  LPROM  storage  is  that  it  is  eight  times  less  dense  than  the 
XROM.  This  is  due  to  the  size  of  the  transistors  needed  to  represent  the  data.  The  transistor  size 
is  limited  by  the  accuracy  of  the  laser  programming  equipment  available. 

The  XROM  and  LPROM  supply  the  same  microword  bits  to  the  same  control  lines,  and  there 
is  no  re-arranging  of  outputs  between  them.  The  LPROM  must  therefore  be  written  so  its  column 
assignments  are  the  same  as  the  XROM,  since  the  XROM  is  fixed. 

4.1.1  Pipeline  Registers.  The  Pipeline  is  an  array  of  master-slave  flip-flops  with  mu.xes 
at  their  input  which  allow  them  to  load  from  two  or  three  different  sources.  The  bits  are  sent  off 
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Figure  4.26.  LPROM  Cell  [Til88]. 
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to  their  respective  decoders  on  the  next  clock  cycle.  The  outputs  of  the  flip-flops  are  sized  for  the 
control  lines  they  must  drive.  Long  control  lines  which  go  to  many  gates  must  have  more  current 
drive  than  ones  which  go  only  a  short  distance  and  end  in  a  few  gates. 

To  cover  the  various  current  needs,  four  different  sized  stage-up  outputs  were  designed.  There 
are  also  two  possible  input  circuits,  either  a  two  or  three  input  mux.  A  section  of  the  pipeline 
showing  the  two  types  of  input  is  shown  in  Figure  4.27.  This  figure  also  shows  the  configuration 
for  the  scanpath  design-for-testing. 
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Figure  4.27.  Pipeline  Registers. 
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The  default  input  for  every  register  is  the  microword  bit  out  of  the  ROMs.  The  other  input 
which  they  all  can  choose  is  the  output  of  the  flip-flop  next  to  them.  This  is  for  forming  the 
sc2uipath  for  testing.  The  optional  input  is  for  taking  in  a  control  bit  from  one  of  the  Rl  registers 
which  hold  the  assembly  language  control  bits. 

The  pipeline  registers  give  the  decoders  more  time  to  compute  the  control  signals  by  presenting 
the  control  bits  at  the  start  of  the  clock  cycle.  If  the  control  bits  were  to  come  directly  from  the 
ROMs,  they  would  not  be  valid  until  after  the  precharge  and  access  times.  This  would  not  do  for 
the  bus  select  decoders,  which  must  have  valid  control  signal  before  the  end  of  precharge. 

7. 1.1  Assembly  Language  Input.  The  muxes  used  by  the  assembly  language  are 
grouped  into  six  sets,  each  set  enabled  by  one  of  the  control  choices  in  the  ‘Macrocode  Support 
Mux  Selects’  field  listed  in  Appendix  A.  Each  routine  which  carries  out  an  assembly  instruction 
uses  one  or  more  of  these  choices  to  override  the  control  bits  from  the  ROMs  with  control  bits  from 
the  Rls.  Table  4.5  shows  the  bits  overridden  by  each  choice.  The  new  control  bits  are  fed  in  before 
the  pipeline  registers  so  they  act  just  like  regular  control  bits. 


Selection 

Overridden  Control  Fields 

RSEL 

BRSEL 

SRSEL 

ALSEL 

SHSEL 

ISEL 

Bus  Selects,  Bus  Ties 

Condition  MUX  Select 

E  Bus  Selects,  E  bus  Tie 

ALU/Shifter  Control,  Bus  Selects,  Bus  Ties 
Barrel  Shift  Amount,  Bus  Selects,  Bus  Ties 
Pointer  and  Incrementable  Register  Control 

Table  4.5.  Assembly  Language  Override  Selections 

4.7. 7.2  Scanpath  Hardware.  Thescanpath  hardware  is  shown  in  Figure  4.27.  The 
scanpath  design  for  testing  scheme  requires  special  pins  for  controlling  the  pipeline  muxes  and 
providing  input  and  outputs.  Separate  pins  were  used  since  they  were  available  and  the  data  and 
address  pins  all  had  multiple  inputs  already. 

The  first  pin  is  the  testmode  pin  which  chains  together  ^lll  of  the  pipeline  muxes  to  form 
the  scanpath.  The  other  two  pins  are  the  testin  and  testout  pins  for  passing  data  to  and 
from  the  scanpath.  Also,  the  clock  lines  to  the  pipeline  registers  are  separate  from  all  the  other 
clocklines  in  the  FPASP.  In  normal  operation  these  clock  lines  are  connected  with  the  rest  of  the 
clock  lines  externally  on  the  circuit  board. 
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To  operate  the  pipeline,  the  test  clocks  are  separated  from  the  rest  of  the  clocks,  and  the 
TESTMODE  pin  is  raised.  The  scanning  can  now  be  done  while  the  rest  of  the  machine  holds  its 
state.  The  control  bits  can  be  observed  or  replaced  through  the  testin  and  testout  pin 

4.8  Interface  Hardware 

The  FPASP  has  several  pins  dedicated  to  external  interfaces.  The  pad  arrangement  of  the 
FPASP  is  shown  in  Figure  4.28.  The  pads  have  been  arranged  so  that  they  are  as  close  as  possible 
to  the  part  of  the  circuit  they  are  connected  to.  For  the  control  pins  this  is  easy  to  do,  but  the 
data  and  address  pins  are  too  numerous  and  require  channel  space  at  the  periphery  of  the  circuit 
to  run  the  busses  to  them. 

The  GO  pin  allows  the  host  to  reset  the  FPASP  to  its  starting  state.  The  go  signal  resets 
the  CAR,  stack  pointer,  and  stack  tag  registers.  The  FPASP  then  starts  up  at  awldress  zero  of  the 
microcode.  This  is  a  small  routine  which  loads  the  lowest  word  of  the  upper  memory  and  checks 
the  LSB.  If  it  is  a  1,  the  code  branches  to  the  first  word  in  the  LPROM,  where  the  user's  routine 
starts.  If  the  LSB  is  zero,  it  continues  on  into  the  built-in  self-test  routine  which  does  an  internal 
check  on  ail  the  hardware  It  writes  the  results  out  to  the  external  memory. 

The  GO  signal  must  be  valid  for  at  least  two  clock  cycles  to  assure  that  the  microsequencer 
has  been  reset.  Figure  4.29  shows  the  timing  between  host  and  FPASP,  and  the  states  of  their 
external  bus  drivers. 

The  DONE  pin  is  controlled  by  a  bit  from  the  microcode  word.  It  signals  that  the  FPASP 
has  completed  its  work  ajid  is  ready  to  start  again.  When  the  FPASP  routine  finishes,  it  branches 
to  a  small  endless  loop  which  raises  the  done  line  and  waits.  This  bit  also  puts  all  of  the  FPASP 
output  pad  drivers  into  a  high  impedance  state  so  the  host  can  control  the  external  memories.  This 
can  also  be  done  at  any  time  by  raising  the  highz  pin. 

The  two  external  flag  pins  101  and  103  are  direct  inputs  to  the  condition  mux.  There  are 
two  pins  dedicated  for  the  precharge  circuitry.  These  would  normally  be  connected  to  clock 
lines.  They  take  the  load  off  of  the  lines  feeding  the  logic  circuitry,  but  they  can  also  be  used  to 
source  precharge  separately  from  the  clocks  if  needed. 
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Figure  4.29.  Host-FPASP  Handshake  Timing. 


V.  Microcell  Design  and  Verification 


5. 1  Introduction 

This  chapter  presents  some  of  the  details  that  went  into  the  design  and  verification  of  the 
FPASP  macrocells.  This  falls  mostly  at  the  CMOS  level  of  Figure  5.1.  This  figure  is  embellished 
slightly  from  the  previous  chapters  to  show  where  the  various  CAD  tools  are  applied  in  the  design 
hierarchy. 


Figure  5.1  CAD  Tools  Used. 

The  first  part  of  the  chapter  will  examine  the  FPASP  circuitry  from  the  layout  point  of  view. 
Tlie  next  part  of  the  chapter  will  discuss  some  of  the  modeling  that  was  done  to  verify  the  design 
at  both  the  device  and  macrocell  levels.  The  last  part  of  the  chapter  presents  an  outline  model  of 


the  FPASP  vvritten  in  the  VHSIC  Hardware  Description  Language,  and  shows  how  that  model  will 
be  used  in  the  design  cycle  for  FPASP  applications. 

5.2  Design  Style 

5.2.1  Cell  Libraries.  The  hardware  design  began  with  an  examination  of  the  cell  libraries 
avriilable  at  AFIT,  especially  the  one  created  for  the  ASP  chip  by  Capt.  Gallagher.  This  survey 
showed  that  some  of  the  hardware  could  be  used  without  change,  but  most  of  it  would  need 
modification.  The  reasons  for  redesign  were  either  because  the  ASP  cells  were  sized  for  a  smaller 
machine,  or  the  FPASP  needed  additional  functionality  from  the  cells. 

The  ASP  library  provided  most  of  the  cells  for  the  barrel  shifter.  That  macrocell  went  together 
very  easily  and  quickly  thanks  to  the  modular  design  created  by  Capt.  Gallagher.  This  was  true  of 
most  of  the  ASP  cells.  Using  these  cells  fixed  the  bus  pitch  in  the  FPASP  to  81  lambda,  but  this 
did  not  present  a  problem  Most  of  the  cells  which  had  to  be  built  for  the  datapath  sections  were 
based  on  parts  of  ASP  cells  which  already  had  this  pitch. 

Another  cell  library  plundered  for  parts  was  the  Winograd  Fourier  Transform  (VVFT)  cell 
library.  The  pad  designs  were  the  closest  to  what  the  FPASP  needed,  and  they  had  been  updated 
for  1.2  micron  CMOS.  Most  of  the  other  WFT  cells  were  for  functions  not  needed  in  the  FPASP. 

5.2.2  Standard  Cells.  All  of  the  registers  are  based  on  the  same  master-slave  flip-flop 
circuit.  The  only  differences  are  in  the  number  and  type  of  output  drivers,  and  the  number  of 
input  multiplexer  channels.  The  various  common  register  circuits  are  shown  in  Figure  5.2.  This 
same  basic  flip-flop  is  also  used  without  the  bus  structures  for  the  flag  flip-flops.  All  of  the  registers 
conform  the  81  lambda  pitch  of  the  data  paths. 

The  gates  used  to  build  the  decoders  throughout  the  FPASP  are  built  from  a  standard  cell 
library  These  cells  are  derived  from  the  bus  select  field  decoders.  The  standard  cells  only  take  up 
a  little  more  space  than  cells  custom-designed  for  each  application  and  they  are  easier  and  faster 
to  lay  out. 

The  decoder  cells  are  kept  very  basic  and  are  made  to  s^ack  together.  The  cells  contain  only 
transistors,  no  well  contacts  or  routing  lines.  After  the  decode  logic  is  written,  the  types  of  gates 
needed  and  their  arrangement  are  decided.  Then  they  are  laid  out  cell  by  cell,  leaving  room  for 
interconnections.  After  the  cells  are  laid  out  the  interconnects  are  made. 
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When  an  entire  decoder  is  laid  out,  the  hierarchy  of  cells  is  flattened  and  the  wells  and  well 
contacts  are  put  in.  Since  the  particular  fabrication  process  for  the  FPASP  is  unknown,  both  the 
N  and  P  wells  ue  drawn  in  and  all  have  well  contacts  to  help  prevent  latch-up. 

Other  standard  cells  were  used  for  staging  up  current  drive.  These  consisted  of  sets  of  ‘dough¬ 
nut’  inverters.  These  cells  were  created  before  the  design  of  the  FPASP  was  completed,  so  once  the 
cells  were  designed  the  design  calculations  for  current  drive  were  based  on  these  ceils. 

The  most  commonly  used  standard  cell  library  at  AFIT  is  the  XROM  library.  This  set  of 
cells  can  be  laid  out  completely  by  software  [Lin88-2].  It  would  be  difficult  to  reliably  lay  out  a 
large  ROM.  Using  the  XROM  compiler,  the  process  is  done  correctly  in  minutes. 

J.S.S  Transistor  Sizing.  The  sizes  of  the  various  control  line  drivers  depend  on  how  fast 
that  control  has  to  become  valid.  For  example,  on  controls  gated  by  precharge,  all  of  the  decoding 
circuitry  up  to  the  final  NAND  gate  has  lOnS  to  settle.  But  the  circuitry  following  that  gate  must 
be  fast  because  now  it  is  part  of  the  critical  timing  path. 

For  the  least  gate  delay,  the  stage-up  of  inverter  gate  widths  should  be  around  2.7  [Gla85]. 
This  rule  was  followed  as  much  as  was  practical  in  the  FPASP  design.  For  signals  that  did  not 
have  to  become  valid  immediately  after  the  control  bits,  the  stage-up  was  stretched  to  as  much  as 
14.  In  cases  where  the  timing  was  critical,  SPICE  models  were  simulated  to  size  the  transistors. 

In  the  case  of  the  bus  ties,  they  have  all  of  precharge  to  become  valid,  so  even  though  the 
sum  of  gates  in  a  tie  is  large,  the  driver  can  be  small  because  it  hets  lOnS  to  charge  or  discharge 
those  gates.  So  in  this  case,  a  stage-up  of  12  was  designed. 

The  ratio  of  P  to  N  transistor  gate  widths  used  in  the  FPASP  ranges  between  1.7  and  2.  This 
is  to  compensate  for  the  higher  channel  resistance  of  the  P  devices.  By  balancing  the  resistance  of 
the  channels,  the  amount  of  current  drive  for  either  a  high  or  low  output  can  be  kept  the  same. 
This  makes  the  rise  and  fall  times  nearly  equal,  so  the  time  the  two  transistors  spend  fighting  each 
other  at  the  switchover  point  is  decreased. 

Most  of  the  muxes  in  the  ALU/Shifter  consist  on  only  N  type  pass  transistors.  Since  these 
devices  cannot  pass  a  full  5  volt  ‘1’,  the  transistor  gate  width  ratio  in  the  inverter  following  them 
must  be  different  from  regular  inverters  to  equalize  the  rise  and  fall  times.  For  these  inverters,  the 
N  transistor  is  made  only  slightly  smaller  than  the  P  transistor.  This  gives  it  the  same  channel 
resistance  as  the  P  transistor  since  it  never  gets  fully  turned  on. 

Transistor  sizing  is  also  important  for  combinational  gates.  For  these  circuits,  the  ratios  of 
the  channel  resistances  were  made  equal  by  an  analogy  to  resistors.  Putting  two  transistors  in  series 
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is  the  same  as  putting  two  resistors  in  series:  the  resistance  is  doubled.  M2dcing  a  transistor’s  gate 
wider  is  like  putting  resistors  in  parallel:  the  resistance  decreases.  The  design  goal  was  to  make  the 
total  worst  case  N  and  P  transistor  channel  resistances  the  same,  within  a  reasonable  layout  area. 

The  NOR  type  zero  flags  in  the  FPASP  presented  a  different  design  criteria.  These  are 
pseudo-NMOS  designs,  where  a  P  transistor  is  used  to  pull  the  output  node  high.  When  any  one 
of  the  parallel  N  tr^ulsistors  turns  on,  it  tries  to  ground  the  node.  In  order  to  make  sure  the  N 
transistor  will  pull  the  node  low  enough  to  be  seen  as  a  ‘O’,  its  channel  resistance  should  be  at  least 
i  times  less  than  the  P  transistor’s.  For  the  FPASP,  this  ratio  was  6  to  1.  This  is  because  the  node 
is  usually  in  the  pulled  down  state,  so  the  power  dissipation  is  less  with  the  smaller  P  transistor. 

Charge  sharing  is  another  problem  which  is  remedied  by  proper  transistor  sizing.  The  loadable 
registers  have  two  T-  gates  at  their  inputs.  If  the  inner  T-gate  closes  and  the  input  changes  before 
the  other  gate  closes,  charge  gets  trapped  between  them.  This  charge  may  be  enough  to  change 
the  state  of  the  register  when  only  the  inner  gate  opens  again.  To  prevent  this  problem,  SPICE 
simulations  were  run  to  size  the  transistors  inside  the  inner  T-gate  so  their  state  would  not  change 
if  the  stored  charge  was  applied  to  their  gates. 

S.2.^  Cell  Design.  Many  cells  were  designed  for  the  FPASP.  From  all  of  this  a  design 
style  emerged  which  made  cell  layout  faster.  One  idea  was  planning  out  the  placement  of  all  the 
contacts  before  the  placement  of  the  transistors.  The  area  of  the  cells  is  usually  limited  by  how 
many  contacts  can  be  squeezed  into  the  spaces  around  the  transistors. 

Time  spent  planning  the  contact  arrangement  made  up  for  time  lost  pushing  circuit  elements 
around  to  make  room  for  that  one  last  contact  that  never  seems  to  fit  anywhere.  In  the  cell  plans, 
symmetry  is  the  most  useful  feature  to  start  with.  Cells  that  have  symmetrical  features  are  easy 
to  lay  out  and  easy  to  modify.  They  also  stack  well  in  the  macrocells,  and  make  it  easier  to  find 
misconnected  nodes. 

Another  simple  design  rule  was  used  at  the  cell  level  for  deciding  which  input  of  a  multiple 
input  gate  a  signal  should  go  to.  These  gates  were  usually  multiple  input  NAND  gates.  These 
gates  have  a  long  chain  of  N  transistors.  The  charge  between  the  transistors  must  all  be  drained 
to  ground  to  change  the  state  of  the  output  node.  The  signal  which  was  predicted  to  arrive  last 
was  put  on  the  gate  closest  to  the  output  node.  When  that  signal  arrives,  the  other  transistors  are 
already  on,  so  the  only  charge  not  already  drained  off  is  the  charge  of  the  output  node.  An  example 
of  this  is  the  bus  select  decoders,  where  the  precharge-bar  signal  is  always  the  last  to  arrive. 
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5.2.5  Macrocell  Design.  The  bus  select  decoders  were  designed  as  blank  cells.  All  five  of 
the  input  control  lines  run  through  each  cell,  as  well  as  precharge.  The  complements  of  the  inputs 
are  made  in  the  cells  where  they  axt  used,  so  only  the  inverted  control  bits  do  not  take  up  wiring 
channel  space.  This  means  that  inverters  are  needed  in  every  cell  that  requires  a  complemented 
input,  but  the  space  for  the  inverters  is  already  there  since  the  cell  must  be  as  wide  as  the  register 
ceils  anyway. 

All  of  the  cells  have  the  same  5  input  NAND  gate.  They  are  personalized  by  overlay  cells 
which  contun  poly  lines  and  vias  for  connecting  the  input  signal  or  its  complement  to  the  NAND 
gates.  There  are  30  small  overlay  cells  for  decoding  the  input  control  choices  ooooi  to  iini.  This 
scheme  reduces  the  memory  required  for  the  cell  library  since  only  the  personalizations  for  each 
version  are  stored,  rather  than  entire  decoders.  The  plain  decoder  cells  were  also  used  as  the  basis 
for  the  standard  cells  used  in  the  other  decoder  circuits. 

Since  the  bus  select  signals  are  common  to  most  of  the  registers,  the  decoding  circuitry  is 
designed  to  be  the  same  width  as  the  general  purpose  register  cells,  and  is  located  directly  next  to 
the  register  which  it  controls.  This  spreads  the  circuitry  out  some,  but  no  additional  routing  space 
is  needed  to  run  the  control  signals  to  the  registers. 

The  latter  case  arose  with  the  barrel  shifter  control  decoder.  That  decoder  was  considered  for 
use  with  the  register  array  since  it  also  decodes  five  control  bits;  but  three  of  them  would  be  needed, 
and  although  they  would  take  up  less  area  by  themselves,  the  extra  area  needed  to  distribute  their 
outputs  to  the  registers  and  the  time  needed  to  lay  out  those  lines  made  them  a  poor  choice. 

The  separate-decoder-for-each-register  scheme  was  aJso  used  in  the  ASP,  although  in  that  case 
they  were  used  to  support  the  ASP  methodology  of  making  the  register  array  as  deep  as  needed 
for  each  application. 

The  horizontal  metal  1  and  vertical  metal  2  convention  common  to  most  AFIT  designs  was 
followed  in  the  FPASP  whenever  possible.  This  convention  makes  the  most  efficient  use  of  the 
routing  channels  by  reducing  the  chance  the  a  line  will  have  to  cross  another  of  the  same  type. 
This  convention  starts  to  fail  when  a  large  number  of  contacts  aue  needed.  This  was  especially  true 
in  the  ALU  cells,  where  routing  also  had  to  be  run  in  polysilicon  through  the  bus  channel. 

Long  poly  runs  were  only  used  where  the  driver  was  large  and  the  load  was  small.  This 
minimizes  the  RC  delay  associated  with  charging  up  a  capacitance  through  a  resistor.  In  this  case, 
the  small  load  makes  the  capacitance  small,  and  a  3  lambda  width  helps  lower  the  resistance  of  the 
poly  line. 
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5.5  SPICE  Modeling 


The  SPICE  3  program  was  used  to  simulate  sections  of  the  circuit  at  the  transistor  level 
[Qua86].  This  program  provides  analog  plots  of  the  node  voltage  throughout  the  circuit.  With 
SPICE  3,  these  plots  could  be  viewed  on  a  color  monitor  for  rapid  feedback  about  a  design  choice. 
Several  optional  transistors  for  the  one  being  tested  were  kept  commented  out  on  adjacent  lines  in 
the  input  file,  so  changes  could  be  made  rapidly.  When  a  good  design  was  reached,  a  hard  copy  of 
the  final  plot  was  made. 

There  is  no  good  tool  in  the  AFIT  CAD  environment  for  converting  the  Magic  layout  designs 
into  SPICE  input  files.  A  method  for  getting  exact  models  from  the  Magic  layout  was  devised.  In 
Magic,  the  transistors  in  question  were  zoomed  in  on  so  they  filled  the  screen,  then  only  the  gates, 
diffusion  and  diffusion  contacts  were  displayed.  On  this  display,  the  various  node  numbers  from  the 
model  were  written  using  an  overhead  marker.  The  various  sources  and  drains  were  also  marked. 
Then  the  perimeters  and  areas  of  the  transistors  could  easily  be  transferred  from  the  screen  to  the 
proper  transistor  listing  in  the  model.  As  each  transistor  was  entered  into  the  model,  its  display 
was  X’d  off  with  the  marker.  The  marker  was  also  used  to  draw  the  boundary  between  transistors 
with  shared  drains. 

5.3.1  Register  Models.  The  FPASP  registers  are  similar  to  the  ASP  registers,  but  have 
been  redesigned  for  1.2  micron  CMOS  based  on  SPICE  simulations.  The  pulldown  transistors  have 
also  been  sized  for  the  larger  busses  in  the  FPASP. 

The  register  model  used  for  this  simulation  included  models  for  the  entire  A  ^lnd  B  busses, 
the  drains  of  the  other  transistors  on  the  busses,  and  the  lengths  of  the  connections  within  each 
register  on  the  bus.  Values  for  capacitances  given  in  the  Weste  text  were  used,  along  with  the 
oxide  thicknesses  shown  in  Figure  5.3.  A  factor  of  two  was  added  for  fringing  effects  that  are  not 
accounted  for  in  the  flat  plate  capacitor  model  used  [Wes85]. 

It  turned  out  that  the  pulldowns  designed  for  the  ASP  were  large  enough  for  the  FPASP, 
but  the  flip-flop  itself  had  some  designed-in  capacitance  that  was  unnecessary  for  the  1.2  micron 
process.  This  was  capacitance  added  to  the  feedforward  inverter  of  the  $2  latch  to  overcome  charge 
sharing  with  the  node  outside  of  the  $2  T-gate.  The  spice  results  show  that  the  new  design  has 
adequate  capacitance  without  making  the  inverter’s  gates  longer.  This  decreases  the  delay  through 
the  inverter,  allowing  it  to  latch  in  a  new  value  faster,  and  also  decreases  the  time  that  power  is 
dissipated  by  the  inverter. 


SiOj 

SiOj 

Si 


C  =  capacitance  of  the  bus  (Farads) 

A  =  bus  length  x  bus  width  (microns  ) 
€=3.45 -16  F/m  (SiOj) 
ti  =  1  micron  (ml) 
tj  =  2  microns  (m2) 


Metal  2 


1  micron 

^2 

Metal  1 

1  micron 

C  =  /«a1(2) 


Figure  5.3.  Bus  Model  used  for  SPICE. 
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The  SPICE  results  also  show  that  an  average  current  of  1mA  is  needed  to  keep  32  register 
cells  in  the  quiescent  sted,e  of  neither  loading  nor  driving.  This  number  is  based  on  a  rough  estimate  ^ 

derived  from  the  plot  of  the  voltage  across  a  0.5  ohm  resistor  put  between  the  circuit  and  ground. 

Also  modeled  was  the  current  needed  to  pull  down  a  tied  A  bus  line.  The  peak  current  during 
pulldown  was  0.8mA.  In  the  worst  case,  all  of  the  A  bus  lines  would  be  pulled  down  together;  for  j 

example,  when  a  pair  of  registers  is  being  initialized  to  zero.  The  peak  current  for  the  register 
pulling  all  the  lines  down  is  25.6mA.  The  very  worst  case  for  the  ground  line  is  when  all  of  the 
busses  (A,B,C,D,E)  are  pulled  down  from  a  1,  which  would  require  a  pe^d(  current  of  102mA. 

At  most,  only  two  general  purpose  register  cells  at  a  time  cam  be  sinking  0.8ma  each  on  a  j 

given  ground  line  per  cycle.  The  cells  each  contain  a  2.4  micron  metall  ground  line  running  through 
them.  Based  on  a  current  carrying  capacity  for  metall  of  1mA  per  micron  of  width  to  prevent  metal 
migration  [Wes85],  these  power  supply  lines  are  sufficient  for  the  peak  current  needs  of  the  register 
array.  j 

The  ASP  decoder  design  was  expanded  for  the  FPASP,  making  use  of  the  SPICE  simulation 
program  to  verify  the  operation  and  speed  of  the  circuits.  The  model  of  a  decoder  was  added  to 
the  register  model  used  above.  This  gives  the  decoder  a  realistic  load.  The  model  uses  the  worst 
case  delay,  where  the  input  to  the  NAND  gate  that  is  farthest  from  the  output  node  is  the  last  one  ^ 

that  changes,  which  leaves  the  maximum  amount  of  charge  to  be  drained  from  the  output  node 
to  ground.  The  lOnS  precharge  time  was  long  enough  to  cover  the  worst  case  delay  through  the 
NAND  gate,  which  was  only  3.44nS.  The  delay  from  the  fall  of  precharge  to  the  rise  of  the  bus 
select  line  was  1.8nS.  This  represents  the  delay  of  the  precharge-bar  NAND  gate  and  the  inverters  ^ 

used  to  stage  up  the  current  drive. 

5.3.2  ALU  Models.  The  worst  case  delay  in  the  FPASP  is  the  path  through  the  carry 
select  adder.  A  SPICE  model  of  the  longest  path  through  the  carry  select  adder  gave  a  12nsec  I 

delay  time.  This  delay,  when  added  to  the  delay  through  the  ALU  input  circuitry  and  the  estimated 
delay  through  the  shifter  stage,  gives  a  total  delay  of  24  nsec.  This  includes  driving  the  data  and 
results  through  the  bus  ties. 

This  worst  case  delay  is  the  main  reason  for  making  the  width  of  the  precharge  pulse  as  short  ^ 

as  possible.  A  10  nsec  precharge  pulse  is  long  enough  for  the  worst  case  of  charging  the  XROMs, 
and  allows  over  25  nsec  for  the  ALU  results  to  become  stable  on  the  C  bus.  The  10  nsec  precharge 
also  leaves  enough  time  for  the  control  bits  to  be  decoded  and  become  stable  at  the  inputs  to  the 

! 

final  prechargebar  NAND  gates. 
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5-4  Design  Verification 

The  best  looking  layout  is  worthless  until  it  has  been  verified  that  it  will  work  correctly,  at 
least  at  the  logic  level.  The  tools  available  at  AFIT  for  verification  include  the  Esim  switch  level 
simulator  and  several  static  design  checkers. 

The  first  level  of  check  comes  during  the  layout  process.  The  Magic  layout  tool  does  continuous 
design  rule  checking  as  the  layout  proceeds.  After  the  layout  is  done,  the  design  is  converted  into 
a  GIF  (Caltech  Intermediate  Format)  file  by  the  Magic  tool. 

The  next  toll  is  Mextra,  which  takes  the  GIF  file  and  extracts  the  netlist  of  transistor  con¬ 
nections,  plus  values  for  the  capacitances  of  the  nodes.  The  resulting  SIM  file  can  be  simulated  by 
the  Esim  program,  but  first  some  static  checks  are  done.  The  Mextra  program  generates  several 
other  files  which  can  point  out  wiring  or  other  errors. 

The  most  useful  of  these  files  is  the  alias  file  which  lists  ail  of  the  labels  which  are  electrically 
connected.  The  Gstat  program  recognizes  certain  labels  as  being  inputs  or  outputs  of  the  circuit. 
These  labels  must  be  the  ones  with  the  fewest  characters  in  them,  otherwise  Mextra  will  alias  them 
to  another  label  on  the  same  line,  and  it  will  not  be  visible  to  Gstat.  The  solution  to  this  problem 
is  to  make  sure  no  other  labels  are  on  these  lines,  or  medce  sure  they  are  longer  than  the  labels  put 
there  for  Gstat  or  Esim. 

Labels  become  a  mixed  blessing  with  these  tools.  On  one  hand,  labels  make  it  easy  to  identify 
a  node,  but  on  the  other  hand  lots  of  labels  are  sure  to  cause  problems  of  the  sort  mentioned  above. 
In  some  cases,  the  labels  are  in  a  cell  that  can’t  be  changed,  because  it  is  used  in  other  designs.  In 
any  case,  the  aliases  must  all  be  explained  or  corrected  before  a  simulation  can  be  done. 

The  static  circuit  checker  Gstat  takes  the  netlist  in  the  SIM  file  and  looks  at  how  all  the 
transistors  are  connected.  It  prints  a  list  of  all  the  transistor  configurations,  and  a  list  of  all  the 
transistors  which  are  isolated  form  the  inputs  or  outputs.  Trzuisistors  are  listed  if  they  cannot  be 
set  to  1  or  0,  if  they  cannot  be  affected  by  an  input,  or  if  the  cannot  affect  the  output.  Gstat  also 
gives  the  Magic  coordinates  for  the  last  gate  closest  to  the  node. 

This  tool  was  run  on  a  version  of  the  ALU/shifter  maM;rocell.  This  version  contained  all  of  the 
control  and  flag  circuitry,  but  the  array  of  ALU  cells  was  shortened  down  to  the  four  cells  which 
contained  unique  circuitry.  The  duplicate  cells  which  were  removed  were  only  slowing  down  the 
verificattion  process.  If  the  ALU/Shifter  works  with  the  cells  in  the  smaller  version,  then  it  should 
work  with  the  full  32  cell  array. 
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The  Mextra  and  Cstat  files  were  used  to  locate  5  wiring  errors.  Another  error  was  found 
while  looking  for  one  of  the  five  others.  This  was  a  power-tied-to-ground  error  that  was  not  found 
by  the  static  checkers.  In  this  case,  one  of  the  ALU  cells  had  a  P-well  contact  tied  to  Vdd.  When 
a  line  leading  to  one  of  the  errors  was  selected  in  Magic,  all  of  the  power  lines  lit  up  as  well. 

After  all  the  Cstat  listings  are  fixed  or  explained,  the  next  step  is  to  make  the  circuit  Esim- 
able  by  removing  circuitry  that  Esim  cannot  properly  simulate.  One  of  these  circuits  is  the  sense 
amplifier  in  the  XROMs.  The  Fixrom  program  will  replace  these  sense  amps  with  inverters.  The 
Nofeed  program  removes  feedback  clocked  inverters,  and  reports  on  how  many  of  the  clocked 
inverters  were  removed.  That  number  should  match  with  the  number  of  flip-flops  laid  out. 

Only  after  all  the  above  steps  have  been  performed  can  Esim  be  used  to  check  the  function 
of  the  circuit.  Esim  takes  a  set  of  input  vectors  and  runs  them  through  the  circuit  model.  The 
results  are  then  compared  with  the  expected  results.  If  they  do  not  match,  the  problem  must  be 
deduced  and  fixed.  Then  the  entire  process  must  be  repeated. 

The  decoders  were  Esim’d  individually  and  then  the  smaller  version  of  the  ALU  was  simulated. 
Several  errors  were  discovered  in  the  decoders:  two  labels  were  reversed  and  the  decoder  to  set  the 
carry  flag  was  outdated.  These  are  the  types  of  errors  that  a  simulation  can  point  out  before  the 
overall  operation  is  checked  to  see  if  the  ALU  is  actually  performing  the  desired  operations 

5.5  VHDL  Model 

A  structural  model  of  the  FPASP  was  written  in  VHDL.  This  will  form  the  basis  of  a  behav¬ 
ioral  model  to  be  written  in  the  future.  The  tree  structure  of  the  model  is  shown  in  Figure  5.4. 
The  model  will  allow  microcode  to  be  simulated  in  software  before  it  is  written  into  the  FPASP 
LPROMs.  Testing  the  microcode  is  a  stage  of  the  FPASP  implementation  cycle  shown  in  Figure 
5.5. 

After  a  working  FPASP  has  been  fabricated,  copies  of  it  can  be  tested  and  then  put  on  the 
shelf.  When  a  user  has  an  application  for  it,  that  application  is  reduced  to  an  algorithm.  The 
algorithm  is  then  converted  to  microcode  and  simulated  using  the  VHDL  model.  After  the  code 
has  been  verified,  the  FPASP  can  be  programmed.  The  programmed  FPASP  is  then  placed  on  a 
test  board  and  tested  again  before  releasing  it  to  the  user. 

Two  tools  are  being  developed  at  AFIT  to  extract  structural  VHDL  models  from  the  SIM 
nctlist  files  produced  by  the  static  circuit  checkers.  These  tools  are  not  yet  completed,  but  the 
FPASP  should  provide  enough  different  cells  for  these  programs  to  prove  themselves  on.  The 
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Figure  5.4.  VHDL  Structural  Components. 
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Figure  5.5.  FPASP  Prototyping  Methodology. 
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STOVE  tool  is  presently  extracting  such  higher  level  circuits  as  master-slave  flip-flops  and  pseudo- 
NMOS  Nor  gates  [Lin85]. 

The  other  tool  is  a  Prolog-language  extractor.  This  tool  was  used  at  the  start  of  this  thesis  to 
reverse  engineer  some  of  the  cells  in  the  ASP  library.  It  produces  a  netlist  of  gates  and  miscellaneous 
transistors.  The  circuit  logic  can  then  be  drawn  out  by  hand  [Duk88]. 
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VI.  Results 


6.1  Introduction 

Figure  6.1  shows  where  this  chapter  will  concentrate  on  in  the  hierarchy  of  processor  design. 
Mostly  this  will  be  microcode  results  from  the  EE588  class  and  routines  written  to  support  those 
projects.  It  will  also  give  some  statistics  on  the  FPASP  chip  and  compare  some  of  the  FP ASP’s 
architectursJ  features  with  other  32  bit  processor  architectures. 

The  FPASP  thesis  effort  took  a  total  of  900  hours.  Table  6.1  shows  the  approximate  percent¬ 
age  of  time  spent  on  various  topics. 


Figure  6.1.  Theme  Figure. 
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Topic 

Percent  Time 

Architecture 

35^ 

VLSI  Design 

35% 

Microcode 

10% 

Layout  (up  to  this  printing) 

10% 

Documentation  for  EE588  Class 

8% 

VHDL  coding 

2% 

TOTAL 

100% 

Table  6.1.  Areas  of  Effort 


Macrocell 

Percent  Area 

Floating  Point  Multiplier 

27^ 

Floating  Point  Adder 

7% 

Other  Processing  H/W 

4% 

ROMs 

11% 

Registers 

16% 

Control  Section 

2% 

Bus  channels 

11% 

Padframe 

19% 

Miscellaneous  Hardware 

3% 

TOTAL 

100% 

Table  6.2.  FPASP  Chip  Area  Allocations 


6.2  VLSI  Layout  Slaiisitcs 

The  FPASP  is  to  be  fabricated  in  a  1.2  micron  double-metal  CMOS  process.  The  final  package 
will  be  a  144  pin-grid-array.  Table  6.2  lists  the  percentage  of  the  total  die  area  to  be  taken  up  by 
each  of  the  main  sections  of  the  FPASP.  The  die  size  is  315  mil  by  380  mil,  for  a  total  of  1 19,700 
mil*.  This  is  under  the  proposed  limit  of  (350  mil)^.  The  areas  are  based  on  macrocells  already 
laid  out  and  the  estimates  used  for  the  floorpian.  The  projected  transistor  count  for  the  FPASP  is 
around  186,000.  The  multiplier  alone  has  over  75,000  transistors.  These  are  mostly  in  the  array  of 
1364  full  adders  and  327  half-adders  which  calculate  the  mantissa. 


6.3  Microcode  Projects  Summary 

6.3.1  Support  Routines.  The  routines  summarized  in  Table  6.3  were  written  for  the  EE588 
class  to  call  in  their  routines.  They  are  part  of  the  library  of  routines  to  be  included  in  the  FPASP 
XROM.  They  include  basic  routines  for  matrix  algebra  and  the  Newton- Raphson  inversion  routine 
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listed  in  Appendix  B.  Table  6.3  lists  the  number  of  lines,  the  number  of  cycles  needed  to  run,  and 
the  number  of  floating  point  operations  performed  (FLOPS). 


Routine 

FLOPS 

Dot  Product 

8 

2n-»-6 

2n-l 

Matrix  X  Vector 

5 

llm-F2nm-3 

2nm-m 

Matrix  X  Matrix 

5 

2p-t-l  Ipm-f  2pnm-l 

2pnm-pm 

1  For  two  matrices  mXn  times  nxp  | 

1  Vector  Scaling 

5 

3n-»-4 

n 

1  For  vectors  of  n  elements  | 

Newton-Raphson  Invert 

15 

31 

12 

Table  6.3.  Subroutine  Statistics 

The  number  of  clock  cycles  depends  on  the  size  of  the  input  matrix  or  vector.  The  matrix  alge¬ 
bra  routines  can  operate  on  inputs  of  any  size,  limited  only  by  the  size  of  the  external  memory.  The 
matrix-matrix  multiply  routine  calls  the  matrix-vector  multiply  routine,  which  calls  dot-product. 
The  efficiency  of  each  routine  in  terms  of  floating  point  hardware  usage  can  be  calculated  as  FLOPS 
per  clock  cycle.  Using  this  metric,  the  efficiency  of  the  dot-product  routine  approaches  100%  as  the 
number  of  elements  in  the  vectors  increase.  The  Newton- Raphson  routine  runs  for  four  iterations, 
which  is  enough  to  have  the  routine  to  converge  on  the  final  result.  Its  floating  point  efficiency  is 
12/31,  or  39%. 

6.3.2  EE588  Projects  Summary.  This  chapter  summarizes  the  results  of  the  EE588  mi¬ 
crocode  projects  [Lin88],  Table  6.4  lists  the  algorithms  implemented  by  these  projects,  the  amount 
of  code  each  one  took  up  (excluding  the  routines  listed  in  the  previous  section),  and  how  efficiently 
they  used  the  floating  point  hardware. 

The  sizing  results  show  clearly  that  the  FPASP  ROMs  are  large  enough  for  complex  algorithms 
such  as  Kalman  filtering  or  neural  net  back-propagation.  The  ROMs  can  hold  these  routines,  the 
subroutines  they  call,  and  a  self-test  routine.  These  routines  were  written  by  students,  many  of 
whom  had  not  written  microcode  before.  Therefore,  these  results  do  not  represent  the  ultimate 
performance  capabilities  of  the  FPASP.  A  better  meaisure  would  be  the  routines  listed  in  Table 
6.3.  For  example,  the  efficiency  of  the  Kalman  filtering  routine  can  be  incre2ised  by  correcting  a 
programming  flaw. 

The  efficiency  of  floating  point  hardware  use  is  an  important  result,  since  those  pieces  of 
hardware  have  been  given  the  most  area  on  the  chip.  If  the  efficiency  was  always  low  it  would 
indicate  that  the  FPASP  architecture  was  not  refined  enough  for  its  intended  application.  The 
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Project 


Kadman  Filter 


LMS  Adaptive  Filter 


Log  Base  2 


Coordinate  Transformation 
rectangular  to  spherical 
spherical  to  rectangular 


Arctangent 


Singular  Value  Decomposition 


Newton-Raphson  Square  Root 


M FLO  PS /sec 


8.8 


Table  6.4.  588  Project  Summary 

most  expensive  hardware  should  be  getting  as  much  use  as  possible,  otherwise  those  functions 
might  be  better  off  done  by  a  separate  device. 

Low  efficiency  could  have  been  due  to  many  causes.  The  architecture  may  have  lacked  the 
flexibility  or  processing  elements  needed  to  keep  the  floating  point  hardware  supplied  with  data.  The 
algorithm  may  not  have  required  intense  floating  point  calculations,  in  which  case  the  FPASP  would 
be  the  wrong  processor  to  use.  The  architecture  may  have  lacked  the  support  needed  for  efficient 
microcode  programming  techniques,  or  the  support  may  not  have  been  used  by  the  programmer. 

The  latter  case  can  be  seen  in  some  of  the  EE588  projects,  where  this  was  the  fust  exposure  to 
microcode  for  most  of  those  students.  The  FPASP  is  a  complex  machine  to  be  learning  microcode 
on,  and  the  version  they  used  was  not  the  final  version,  since  one  of  the  purposes  of  the  projects 
was  to  refine  the  design. 

As  it  turned  out,  the  FPASP  met  the  objectives  of  being  generic  enough  for  a  variety  of 
applications,  and  flexible  enough  to  perform  them  efficiently.  These  results  indicate  which  types 
of  algorithms  are  best  for  the  FPASP,  The  ones  which  use  the  features  of  the  FP.ASP  architecture 
for  indexing  into  matrices  and  transferring  the  data  into  and  out  of  the  floating  point  hardware 
with  a  minimum  of  contention  for  the  D  bus  are  the  best.  The  routines  which  do  mostly  integer 
manipulations  to  set  up  a  floating  point  operation  are  the  poorest  ones  by  this  efficiency  metric. 

A  self  test  routine  W2is  also  written.  It  used  370  lines  of  code.  This  reflects  the  problem  with 
self-test:  a  large  subset  of  the  possible  commands  must  be  run  through  to  fully  test  the  machine. 
The  test  code  was  written  to  detect  stuck-at  faults,  and  assumes  the  hardware  will  function  properly 
if  fabricated  properly. 
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The  following  section  lists  some  design  changes  made,  based  on  the  results  of  the  EE588 
projects.  Application  of  some  of  these  changes  to  the  self-test  code  could  reduce  it  by  90  lines  of 
code.  This  frees  up  more  XROM  storage  for  the  library  of  common  subroutines. 

6.3.3  Result  of  Project  Suggestions.  As  part  of  the  project  write-ups,  the  students  were 
asked  to  suggest  modifications  which  would  make  their  code  more  efficient.  Their  suggestions  are 
listed  in  Table  6.5.  The  bottom  of  the  table  lists  the  ones  which  could  be  added  to  the  FPASP 
architecture  without  major  modifications.  Of  the  ones  not  added,  one  has  been  implemented  in  a 
round-about  way.  The  suggestion  to  have  the  register  selections  indexed  in  addition  to  being  directly 
selectable  from  the  ROM  can  be  done  using  the  hardware  added  for  the  assembly  language. 

As  a  result  of  adding  that  hardware,  the  register  select  fields  can  be  overwritten  from  the  R1 
registers  using  the  RSEL  command.  The  user  can  format  the  data  in  the  Rls  to  select  any  set  of 
registers  auid  bus  ties  desired.  The  register  selections  can  be  incremented  or  decremented  using  the 
ALUs,  as  long  as  an  operation  in  one  field  does  not  overfiow  into  the  next  one. 

The  other  suggestions  require  too  many  control  bits.  The  registers  for  swapping  could  easily 
be  built  in  pairs  using  the  pointer  registers,  but  there  are  no  control  bits  left  free  to  control  the 
swapping.  The  control  would  have  to  be  stuck  into  one  of  the  unused  choices  in  an  existing  field, 
which  would  mean  it  could  not  be  done  concurrently  with  the  other  instructions  in  that  field. 

6-4  Architectural  Comparisons 

The  FPASP  architecture  shares  features  with  other  processors.  How  these  features  affect  the 
software  are  seen  in  the  code  written  for  the  FPASP  as  well  as  in  the  literature.  The  most  common 
shared  feature  is  the  pipelined  microsequencer  control  and  pipelined  floating  point  hardware. 

For  example,  the  Motorola  MC88100  has  a  pair  of  floating  point  pipelines  [Mot88].  The 
multiplier  is  six  levels  deep,  whereas  the  FPASP  has  only  two  levels.  The  tradeoff  made  here 
is  speed  versus  area.  The  Motorola  chip  takes  longer  to  do  a  multiply,  but  has  less  redundant 
hardware.  The  FPASP  multiplier  uses  redundant  hardware  to  produce  a  result  every  two  clock 
cycles. 

The  result  is  the  FPASP  process  is  much  simpler  to  keep  track  of  both  in  the  microcode  and 
in  the  hardware.  The  MC88100  requires  extra  registers  to  keep  track  of  the  state  of  the  machine 
at  each  level  of  the  pipeline.  This  is  in  case  the  contents  are  invalidated  by  a  subsequent  operation 
i-nd  the  pipeline  must  be  flushed. 
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Project 

Suggested  Improvements 

Kalman  Filter 

literal  insert  on  both  datapaths 
two  registers  load  at  once  from  C  bus 

C  bus  tie 

Registers  load  from  A  and  B  busses 

Neural  Nets 

literal  insert  on  both  datapaths 

LMS  adaptive  filter 

indirect  register  selection 

incrementable  MAR 

separate  literal  and  next  address  fields 

Log  Base  2 

none 

Coordinate  Transformation 

more  registers  load  from  D  bus 
register  swaps 

Singular  Value  Decomposition 

literal  insert  on  both  datapaths 
incrementable  MAR 

Self  Test 

multiple  registers  load  from  C  bus 
indirect  register  selection 
register  swaps 

Changes  made  to  FPASP 

literal  insert  on  both  datapaths 

incrementable  MAR 

three  registers  load  from  D  bus 

C  bus  tie 

Table  6.5.  Changes  Suggested  by  588  Students 

In  the  FPASP  there  is  only  one  stop  in  the  pipeline,  but  two  clock  cycles  are  used  to  let 
the  results  settle.  The  multiplier  has  no  internal  registers  to  reset.  If  a  multiply  in  progress  is 
invalidated  there  is  no  penalty  for  flushing  the  pipeline.  The  next  operation  can  be  loaded  before 
the  invalid  one  is  done  and  the  hardware  will  still  settle  on  the  correct  result  after  two  clock  cycles. 

The  tradeoff  of  hardware  for  speed  in  the  FPASP  is  not  made  in  the  MC88100  because  of 
its  more  general  purpose  intentions.  Also,  the  two  cycle  delay  for  the  FPASP  multiplier  can  be 
taken  care  of  in  its  microcode.  This  shows  where  the  two  architectures  are  fundamentally  different 
despite  some  shared  features. 


6.4.1  No  ‘RISC’  Involved.  The  MC88100  is  a  ‘RISC’  (Reduced  Instruction  Set  Computer) 
type  of  processor  which  uses  simple  commands  that  must  be  decoded  in  a  single  clock  cycle.  It 
does  not  have  microcoded  routines  to  keep  track  of  the  multiplier’s  latency.  Each  command  must 
either  be  carried  out  immediately  or  pushed  into  a  pipeline. 

Although  the  FPASP  has  a  ‘reduced’  set  of  instructions,  they  are  reduced  only  in  the  sense 
that  they  represent  a  lower  level  of  hardware.  The  number  of  possible  combinations  is  immense, 
and  the  re>utines  written  in  microcode  are  basically  single,  very  complex  instructions. 
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Even  if  the  FPASP  was  only  running  programs  written  in  its  assembly  language  it  would  not 
qualify  as  a  ‘RISC’  despite  the  fact  that  there  are  only  30  assembly  ‘instructions’.  H.  M.  Sprunt 
has  proposed  six  criteria  for  a  nuu:hine  to  qualify  as  an  ‘RISC’  [Mit86];  they  are  listed  in  Table  6.6. 


1.  Singlt-cyclt  operation.  This  facilitates  the  rapid 
execution  of  simple  functions  which  predominate  a  computer’s 
instruction  stream  and  it  promotes  a  low  interpretive  overhead. 

2.  Load/store  design.  Follows  from  a  desire  for 

single-cycle  operation. _ 

3.  Hardwired  control.  For  the  fastest  possible  single-cycle 

operation.  Microcode  leads  to  slower  control  paths  and  adds  to 
interpretive  overhead. _ 

4.  Relatively  few  instraciiona  and  addressing  modes  This 
facilitates  a  fast,  simple  interpretation  by  the  control  engine. 

5.  Simple  instruction  format.  The  consistent  use  of  a  simple 
format  eases  the  hardwired  decoding  of  instructions,  which  again 
speeds  control  paths. 

6.  More  compile  time  effort.  RISC  machines  are  predicated  on 
running  only  compiled  code.  This  offers  an  opportunity  explicitly 
to  move  static  runtime  complexity  into  the  compiler. 


Table  6.6.  Suggested  RISC  Criteria  [Mit86]. 

The  FPASP  certainly  meets  the  last  three  of  these  criteria,  but  the  fact  that  it  is  microcode 
driven  takes  it  out  of  the  realm  of  ‘RISC’  processors.  This  puts  it  with  the  ‘CISC'  processors 
(Complex  Instruction  Set  Computers). 

Although  ‘complex’  usually  means  many  instructions,  the  FPASP  has  only  30.  But  over  half 
of  these  can  be  programmed  by  the  user,  making  them  as  complex  as  needed  to  simplify  the  code 
for  their  routine.  So  a  single  complex  FPASP  instruction  could  do  a  multiply-and-accumulate,  or 
memory  management  chore  for  an  operating  system. 

The  pre-defined  instructions  can  be  considered  complex  also,  since  they  can  be  used  to  im¬ 
plement  nearly  all  of  the  microcode  instructions.  They  represent  a  complex  set  of  possible  instruc¬ 
tions. 
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VII.  Conclusions  and  Recommendations 


7.1  Conclusions 

The  FPASP  research  effort  has  pointed  out  one  of  the  strengths  of  the  AFIT  VLSI  program. 
This  >8  the  use  of  ongoing  research  for  class  projects.  The  researcher  gets  additional  data  and 
the  students  get  to  work  on  a  real  project  instead  of  a  dry  textbook  example.  This  also  provides 
continuity,  since  some  of  those  students  will  be  picking  up  this  research  next  year.  The  researcher 
also  gets  experience  producing  documentation  suitable  for  students  who  are  unfamiliar  with  the 
research.  The  quality  of  the  results  and  the  experience  gained  by  both  sides  benefit  everyone. 

Projects  with  the  scope  of  the  FPASP  require  this  type  of  synergism  in  order  to  produce 
meaningful  research.  If  this  research  had  been  attempted  without  the  class  project  the  result 
would  surely  have  been  less  satisfactory.  This  conclusion  is  even  more  meaningful  now  that  the 
next  class  of  students  have  selected  their  theses,  and  the  FPASP  is  being  pursued  for  two  of  them. 

The  FPASP  presents  a  new  rapid  prototyping  methodology,  but  this  in  no  way  invalidates 
the  one  proposed  by  Capt.  Gallagher.  In  fact,  since  most  of  the  cells  in  the  FPASP  have  been 
designed  to  the  same  bus  pitch  as  those  already  in  the  ASP  library,  they  can  also  be  drawn  on  for 
prototyping  those  types  of  ASPs.  The  new  cells  extend  the  scope  of  that  library  into  the  double 
precision  and  laser-programming  arenas. 

The  dual  32-bit  processor  architecture  of  the  FPASP  is  a  good  way  to  incorporate  double 
precision  hardware  into  a  machine  which  must  also  perform  integer  operations.  The  32-bit  datap¬ 
aths  match  those  of  other  processors  in  common  use.  This  will  make  interfacing  to  other  processors 
simpler,  whether  indirectly  through  a  memory  bank  or  directly  through  a  bus.  Any  other  scheme 
for  splitting  a  group  of  64  bits  into  a  set  of  smaller  groups  would  either  be  asymmetric,  and  there¬ 
fore  harder  to  lay  out,  or  it  would  require  more  separate  datapaths,  none  of  which  would  be  a 
convenient  number  of  bits  wide.  More  separate  datapaths  would  also  require  more  pins  if  each  were 
to  have  its  own  memory  bank  and  address  bus. 

The  choice  of  six  icrementable  registers  and  four  pointer  registers  is  enough  for  most  ap¬ 
plications  involving  matrix  algebra.  Nested  loops  did  not  go  beyond  three,  so  incrementers  were 
available  for  loops  both  in  the  subroutines  and  in  the  main  routines.  Most  of  the  matrix  algebra 
required  only  three  matrices;  two  sources  and  one  result. 

The  depth  of  the  external  stack  was  not  reached  by  any  of  the  routines  written  by  the  EE588 
students.  The  deepest  use  was  53,  but  this  was  for  a  routine  that  only  called  itself.  If  the  routine 
called  other  subroutines  as  well  as  calling  itself,  the  deeper  stack  would  be  more  critical. 


l.Z  Recommendations 

In  keeping  with  the  conclusions  made  above,  a  possible  class  project  presents  itself  for  the 
FPASP  VHDL  model.  The  existing  model  is  only  a  frame  for  organizing  the  behavioral  descriptions 
of  the  components.  This  makes  it  ideal  for  a  class  project.  All  of  the  groups  will  be  working  on 
a  common  machine,  but  there  are  enough  components  to  make  each  project  worthwhile.  The 
structural  frame  provides  the  means  to  define  the  scope  of  each  group  project,  and  it  specifies  all 
of  the  external  signals. 

The  FPASP  should  be  designed  with  LPROM  cells  which  use  meted  links  for  programming 
rather  than  diffusion  links.  The  laser  programming  hardware  presently  available  is  already  capable 
of  cutting  metal  lines  with  the  required  accuracy.  Forming  links  in  diffusion  apparently  requires 
more  precise  control  of  the  laser  pulse’s  frequency  and  duration  than  cutting  metal  links  does. 
The  fusable  metal  link  requires  about  the  same  amount  of  space  as  the  diffusion  link,  so  there  is 
no  reason  to  rely  on  the  riskier  diffusion  process  at  this  time.  All  of  the  software  and  hardware 
available  now  can  be  used  for  either  process  [Til88]. 

The  previous  ASP  prototyping  method  proposed  by  Capt.  Gallagher  should  be  retained.  It 
could  even  be  used  for  classes  in  advanced  VLSI  design,  where  the  students  are  given  an  application 
and  given  this  library  of  cells  to  implement  it. 

The  core  processor  of  the  FPASP  could  ako  serve  as  a  nucleus  for  laser  programmable  .\SPs 
in  other  areas  with  computationally  intense  applications.  There  are  many  areas  where  all  of  the 
necessary  hardware  could  be  integrated  onto  the  same  chip  as  the  processor.  These  would  re-use 
the  FPASP  layout  and  merely  replace  the  floating  point  macrocells  with  ones  more  suitable  for  the 
area  in  question. 

For  example,  the  multiplier  and  adder  could  be  replaced  with  macrocells  for  manipulating 
data  in  byte-size  chunks,  along  with  error  detection  and  correction  circuitry.  This  would  make  the 
new  ASP  ideal  for  a  variety  of  communications  applications.  Or  the  floating  point  hardware  could 
be  sacrificed  for  slower  but  smaller  versions.  The  space  saved  could  then  be  used  for  hardware  to 
interconnect  a  large  array  of  ASPs  much  like  the  INMOS  ‘Transputer’.  The  possibilities  are  almost 
endless  once  the  basic  core  is  designed. 

One  recommendation  for  the  FPASP  in  particular  is  to  send  out  some  parts  for  fabrication 
as  MOSIS  tinychips  before  the  final  chip  is  fabricated.  This  has  already  been  done  in  the  case  of 
the  LPROM  and  the  hardware  on  the  ASP  chip.  The  pads  are  the  derived  from  the  VVFT  chips, 
so  those  can  also  be  tested. 
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Some  circuits  which  would  benefit  from  more  testing  are  the  ALU/Shifters  and  two  types  of 
incrementable  registers.  These  represent  two  of  the  longest  delay  paths  and  should  be  fully  tested. 
In  the  case  of  the  ALU,  a  shortened  version  like  the  one  used  for  ESIM  could  be  fit  on  a  forty-pin 
tinychip.  This  would  provide  enough  pins  for  all  the  controls  (8),  flags  (5)  and  data  I/O  (12),  with 
others  left  over  for  observability  of  the  control  decoders. 

The  programmable  assembly  language  idea  could  be  pursued  farther.  If  the  mapping  of  the 
R1  registers  to  the  macromuxes  on  the  control  pipeline  was  partially  laser-programmable,  the  user 
could  define  instruction  formats  as  well  as  the  opcode  and  microcode  routine.  This  could  be  useful 
if  a  compiler  were  written  for  the  FPASP.  The  ‘specific  application’  would  then  be  support  for  the 
compiler. 
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Appendix  A.  Microword  Format 


F.P.ASP  Architecture  SpeciHcation 
Microword  Format 
rev  3.0:  20  October  1988 


Upper  ROM 

64  Bits 


1  B  BUS 

C  BUS 

ALU 

FPMULT 

FPADO 

INC 

1  AB  PTB 

MBB 

png 

'Em 

SEL 

SEL 

SEL 

EUaSi 

CTHL 

CTTU. 

crmd 

CTBL 

CTPL 

iSi?1 

Bal 

(5) 

(5) 

«) 

Kol 

(3) 

(3) 

(3)  1 

(3) 

(2) 

Wm 

Bn 

HD 

C  BUS 

1  UT  INS 

1  MUX 

branch 

MACRO 

TIE 

ctbl 

1  SELECT  1 

CTRL 

SEL 

(') 

(A) 

(6) 

1  (2) 

(3) 

Lower  ROM 


64  Bits 


A  BUS 

B  BUS 

C  BUS 

ALU 

SHIFT 

BARREL 

BARREL 

INC 

AB  PTR 

MBR 

E  BUS 

A  BUS 

SEL 

SEL 

SEL 

SEL 

CTRL 

SET-UP 

SHIFT 

CTRl 

CTRL 

CTRL 

iSBiH 

SEL 

TIE 

(5) 

(5) 

(5) 

(4) 

(4) 

(2) 

(5) 

(3) 

(3) 

(2) 

iSli 

f2) 

(1) 

A.l  Definitions  of  Microcode  Fields 


Raviaion  3.0  10  lovaabar  1988 


Uppar  RON  Fialds  - 

64  bits 

bits  0 

-4  Uppar  A  Bus  Salact 

00000; 

lOPl  (no  driva)  (DEFAULT) 

00001 : 

AUsRl  (nppar  A  bus  =  uppar  Rl) 

00010: 

AU=R2 

1 

11001: 

1 

AU-R2S 

11010: 

AU=II1 

(uppar  A  bus  =  uppar  IICl) 

11011: 

AU=II2 

11100: 

AU=II3 

11101: 

AU=APT 

(uppar  A  bus  =  A  pointar) 

11110: 

AU=BPT 

mil: 

AU=NBR 

(uppar  A  bus  =  uppar  MBR) 

bits  £ 

-9  Uppar 

8  Bus  Salact 

00000: 

M0P2 

00001; 

BU=R1 

00010: 

BU»R2 

1 

11001: 

1 

BU=R25 

11010: 

BU=IIH 

11011 

BU=I1I2 

11100 

BU=IR3 

11101 

BU=FP* 

(uppar  B  =  rasult  of  FP 

MULTIPLIER 

11110 

BU=FP+ 

(uppar  B  =  rasult  of  FP 

ADDER.  MSB 

11111 

BU=NBR 

bits  10-14  Uppar 

C  Bus  Salact 

00000 

lOPS 

(no  load)  (DEFAULT) 

00001 

R1=CU 

(uppar  Rl  loads  from  uppar 

C  bus) 

00010 

R2=CU 

1 

11001 

1 

R2S=CU 

11010 

I1H=CU 

(upper  I]fCl  loads  from  upper  C  bus) 

11011 

I*2=CU 

11100 

IR3=CU 

11101 

APT=CU 

(A  pointer  loads  from  uppar  C  bus) 

11110 

BPT=CU 

11111 

KBR=CU 

(Uppar  MBR  loads  from  uppar  C  bus) 

bits 

15-18  Uppar  ALU  Salact 

A-2 


0000 

0000 

0001 

0010 

0011 

0100 

0101 

0110 

0111 

1000 

1001 

1010 

1011 

1100 

1101 

1110 

1111 


I0P4 

NOVUI  (SHIFTER_IIPUT-A.  flags  unaffectsd) (DEFAULT) 

ORU  (SHIFTER.IIPUT  »  A  or  B) 

AIDU  (SHIFTER_I1PUT  -  A  and  B) 

XORU  (SHIFTEB_IIPUT  =  A  xor  B) 

MOVU  (SHIFTER_IIPUT  -  A.  affects  flags) 

lAlDU  (SHIFTER_IIPUT  -  A  Hand  B) 
lORU  (SHIFTER_IIPUT  «  A  nor  B) 

lOTU  (SHirrER_IIPUT  =  A*) 

IICU  (SHIFTER_IIPUT  =  A  +  1) 

SETU  (SET  CARRY  FLAG  FLIP.FLOP) 

ADCU  (SHIFTER.IIPUT  =  A  +  B  +  cy  flip-flop) 

ADDU  (SHIFTER_IIPUT  =  A  +  B) 

lEGAU  (SHIFTER.IIPUT  -  -A) 

SUBU  (SHIFTER.IIPUT  =  A  -  B) 

SWBU  (SHIFTER.IIPUT  =  A  -  B  -  cy  flip-flop) 

DECU  (SHIFTER.IIPUT  =  A  -1) 


bits  19-22  Upper  Shift  Control  (IIPUTS  COME  DIRECTLY  FROM  ALU) 


0000 

0001 

0010 

0011 

0100 

0101 

0110 

0111 

1000 

1001 

1010 

1011 

1100 

1101 

1110 

1111 


I0P6  (SHIFTER  DOES  lOT  DRIVE  C  BUS)  (DEFAULT) 

GIDCU  (C  9  0  short  C  bus  to  GID,  shift  flags  unaffected) 

PASSU  (C  =  SHIFTER  IIPUT,  shift  flags  unaffected) 

SLOTU  (SHL  with  SH.OUT  into  LSB) 

SLNSU  (circulate  left  with  MSB  into  LSB) 

SLCYU  (SHL  with  CY  out  of  ALU  into  LSB) 

SLOU  (SHL  with  0  into  LSB) 

SLIU  (SHL  with  1  into  LSB) 

SRLSU  (circulate  right  with  ALU  LSB  into  MSB) 

SRCFU  (SHR  with  carry  flip-flop  into  MSB) 

SRSU  (SHR  with  SIGI  FF  into  MSB) 

SROTU  (SHR  with  SH.OUT  FF  into  MSB) 

SRSEU  (SHR  with  sign  extension  =  MSB  into  MSB) 

SRCYU  (SHR  with  CY  out  of  ALU  into  MSB) 

SROU  (SHR  with  0  into  MSB) 

SRIU  (SHR  with  1  into  MSB) 


bits  23-25  Floating  Point  Multiplier  Control 
000:  I0P6 

000:  FPe  (do  floating  point  multiply)  (DEFAULT) 

001:  FP*D  (drive  results  onto  C  bus,  latch  flags) 

010:  FP*L  (load  A.B  bus  inputs) 

Oil:  FPeLD  (load  A,B  bus  inputs,  drive  C  bus,  latch  flags) 

100:  IIT*  (do  integer  mult:  this  must  be  selected  during  INT*) 

101:  IITeD 

110:  IIT*L  (load  32  bit  integers  from  ARB  busses) 

111:  IIT*LD 


bits  26-28  Floating  Point  Adder  Control 
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[ 


000: 

■0P7 

000: 

FP+ 

(do  floating  point  add)  (DEFAULT) 

001: 

FP+D 

(drive  results  onto  C  bus,  latch  flags) 

010: 

FP+L 

(load  A,B  bus  inputs) 

oil: 

FP+LD 

(load  A.B  bus  inputs,  drive  C  bus,  latch  flags) 

100: 

FP- 

(do  fl  pt  subtr:  this  must  be  selected  during  subtr) 

101: 

FP-D 

(drive  C  bus  while  doing  subtract) 

110: 

FP-L 

(load  new  inputs  while  doing  subtract) 

111: 

FP-LD 

(load  and  drive  while  doing  subtract) 

bits  29-31  Upper  Incrementable  Registers  Control 


000: 

I0P8 

(no  increment)  (DEFAULT) 

001: 

UII1+ 

(increment  upper  IICI) 

010: 

UII2+ 

(increment  upper  IRC2) 

100: 

UII3+ 

(increment  upper  IRC3) 

bits 

32-34  A.B 

Pointer  Controls 

000: 

I0P9 

(no  change)  (DEFAULT) 

001: 

AII.L 

(load  A  increment  size) 

010: 

BII_L 

(load  B  increment  size) 

oil: 

ABII.L 

(load  A,B  increment  sizes) 

100: 

APT+ 

(increment  A  pointer) 

101: 

A+.AL 

(increment  A  pointer,  load  A  increment) 

110: 

B+_BL 

(increment  B  pointer,  load  B  increment) 

111: 

BPT+ 

(increment  B  pointer) 

bits 

35-36  Upper  Memory  Buffer  Registers  Control 

00: 

HOPIO 

(no  action)  (DEFAULT) 

01: 

R1=0U 

(Load  upper  R1  from  upper  0  bus) 

10; 

R2=0U 

(Load  upper  R2  from  upper  D  bus) 

11: 

MB=0U 

(Load  upper  MBR  from  upper  D  bus) 

bits  37-38  Upper  Memory  Address  Register  Control 

00:  NARU  (upper  MAR  drives  upper  Address  bus)  (DEFAULT) 

01:  MAR=CU  (Drive  upper  addr  bus  A  load  from  C  bus) 

10:  MAR-EU  (Drive  upper  addr  bus  A  load  from  E  bus) 

11:  IMARU  (upper  MAR  does  not  drive  upper  addr  bus, 

Data  and  memory  pads  go  to  high  impedence) 

bit  39  E  Bus  Tie 

00:  lOPll  (no  drive)  (DEFAULT) 

01:  ETIE  (tie  upper  and  lover  E  busses  together) 

bits  40-41  Upper  E  Bus  Select 

00:  II0P12  (no  drive)  (DEFAULT) 


01:  E=APT  (driva  A  poi&tar  onto  upper  E  bus) 

10:  E=BPT  (drive  B  pointer  onto  upper  E  bus) 

11:  NARU-f  (increnent  upper  MAR) 

bit  42  Upper  Nearary  Chip  Select  Bar 

0:  I0P13  (active  low)  (DEFAULT  VALUE:  chip  selected) 

1 :  UCS_H 

bit  43  Upper  Memory  Write  Enable  Bar 
0:  VEBU  (active  lorn) 

1:  10P14  (DEFAULT  VALUE:  write  disabled,  ^read  mem) 

bit  44  Upper  Memory  Chip  Output  Enable  Bar 

0:  lOPlS  (active  low)  (DEFAULT  VALUE:  output  enabled) 
1 :  U0E_H 

bits  45-47  Function  ROM  Select 

000:  I0P16  (ROM  does  not  drive  C  bus)  (DEFAULT) 

001:  SQR  (Square  Root) 

010;  RCP  (Reciprocal) 

Oil:  user  defined 

100;  user  defined  (LPROH) 

101:  user  defined  " 

110:  user  defined  " 

111;  user  defined  " 

bit  48  C  Bus  Tie 

0;  I0P17 
1:  CTIE 


bits  49-52  Literal  Inserter  Control 


0000 

N0P18 

(SHIFTER  DOES  MOT  DRIVE  C  BUS)  (DEFAULT) 

0001 

SAVl 

(flags  setl  to 

upper  MSBs,  LSBs  no 

chamge) 

0010 

SAV2 

(flags  set2  to 

upper  MSBs,  LSBs  zeroed) 

0011 

SAV3 

(flag 

s  set3  to 

upper  LSBs,  MSBs  no 

ch2Lnge) 

Mnemonic  system 

example:  ILZL 

0100 

ILZL 

0101 

INRL 

1st 

character 

Insert  literal 

0110 

IMZL 

0111 

ILRL 

2nd; 

Lsb’s  or  Nsb’s  get  literal 

1000 

ILZU 

1001 

IMHU 

3rd: 

Zero  other  half  or  Ho  change 

1010 

IMZU 

1011 

ILRU 

4th: 

Upper/Lower/Both  busses  get 

literal 
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So  thtt  mnefflouic  ILZL  naans  Insert  the 
literal  on  the  Lab’s  and  Zero  out  the  msb's 
of  the  Loser  A  bus  only. 


1100:  ILZB 

1101:  IMIB 

1110:  IMZB 

1111:  ILIB 

bitd  63-58  Conditional  Multiplexer  Select 

000000:  FAL  (unconditionally  false)  (DEFAULT) 

000001:  IQl  (I/O  interrupt  level  1) 

000010:  102  (I/O  interrupt  level  2) 

000011:  MZ  (nultiplier  zero) 

000100:  NOVF  (mult  overflow/ int  result  more  them  32  bits) 

000101:  HUI  (multiplier  underflow) 

000110:  MIAI  (multiplier  lal  -  lot  a  lumber) 

000111:  MOEI  (multiplier  denormalization  trap) 

001000:  AZ  (adder  zero) 

001001:  AOVF  (adder  overflow) 

001010:  AIAI  (adder  lal  -  lot  a  lumber) 

001011:  AOIF  (Inputs  to  adder  differ  by  more  than  2**63) 

001100:  TUPS  (MOVF  +  MIAI  +  AOVF  +  AMAH  +  UALUO) 

001101:  UIIIZ  (Upper  IICI  =  0) 

001110:  UII2Z  (Upper  IIC2  =  0) 

001111:  UII3Z  (Upper  IIC3  =  0) 

010000:  USO  (Upper  shifter's  shifted  out  bit  *0) 

010001:  LIIIZ  (Lower  IHCl  =  0) 

010010:  LII22  (Lower  IIC2  »  0) 

010011:  L1I32  (Lower  IMC3  =  0) 

010100:  UALUZ  (Upper  ALU  zero) 

010101:  UALUK  (Upper  ALU  negative) 

010110;  UALUO  (Upper  ALU  overflow) 

010111:  UALUC  (Upper  ALU  carry) 

011000:  unused 

011001:  unused 

011010:  unused 

011011:  UEVN  (Integer  on  upper  C  bus  is  even)(LSB  =  0) 

011100:  UR1_0  (URl  bit[0]=l)(LSB) 

011101:  URl.l  (URl  bit[l]=l) 

011110:  UR1_2  (URl  bitC2]=l) 

011111:  UR1_3  (URl  bit[3]=l) 

100000:  TRU  (Unconditionally  true) 

100001:  1101  (not  I/O  interrupt  level  1) 

100010:  1102  (not  I/O  interrupt  level  2) 

100011:  MMZ  (not  multiplier  zero) 

100100:  MMOVF  (not  multiplier  overflow) 

100101 :  NNUI  (not  multiplier  underflow) 

100110:  HMHAM  (not  multiplier  Hal) 

100111:  IHDEH  (not  multiplier  denormalization  trap) 

101000:  lAZ  (not  adder  zero) 

101001:  lAOVF  (not  adder  overflow) 

101010:  HAMAH  (not  adder  MaH) 

101011:  HADIF  (Inputs  to  adder  do  not  differ  by  more  than  2**63) 
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101100:  mps  (MOVF  +  MIAI  +  AOVF  +  AIAI  +  UALUO  =  0) 

101101:  mill  (Uppttr  IICI  not  »  0) 

101110:  mi2l  (Upper  IIC2  not  -  0) 

101111:  mi3l  (Upper  IIC3  not  »  0) 

110000:  LSO  (Lover  shilter’s  shifted  out  bit  =0) 

110001:  LIIll  (Lover  IICI  not  »  0) 

110010:  UI2I  (Lover  IIC2  not  «  0) 

110011:  LII3I  (Lover  IIC3  not  »  0) 

110100:  LALUZ  (Lover  ALU  zero) 

110101:  LALUl  (Lover  ALU  negative) 

110110:  LALUO  (Lover  ALU  overflov) 

110111:  ULUC  (Lover  ALU  carry) 

111000:  unused 

111001:  unused 

111010:  unused 

111011:  LEVI  (Integer  on  loser  C  bus  is  even)(LSB  =  0) 

111100:  UR1_28  (URl  bitC28]=l) 

111101:  UR1_29  (URl  bitC29]=l) 

111110:  UR1_30  (URl  bit[30]=l) 

mill:  UR1_31  (URl  bit [31]=!) (MSB) (also  used  as  immed  addr  flag) 

bits  S9-60  Branch  Control  (if  condition  not  true,  next  adr  =  CAR-t-1) 

00:  BR  (conditional  branch)  '  (DEFAULT) 

01:  RET  (conditional  return) 

10:  CALL  (conditional  call) 

11:  NAP  (unconditionally  use  the  NAP  next  address) 

bits  61-63  Macrocode  Support  Nux  Selects 
000:  I0P19  (DEFAULT) 

001:  RSEL  (override  register  select/bus  tie  fields) 

010:  BRSEL  (override  condition  aux  select  field) 

Oil:  SRSEL  (override  Ebus/ETie  ctrls  vith  address  source  sel) 

100:  ALSEL  (override  ALU/Shift .  reg  sel,  and  bus  tie  fields) 

101:  SHSEL  (override  Barrel  shift  Ctrl,  reg  sels,  and  bus  ties) 

110:  ISEL  (override  increment  and  pointer  Ctrl  fields) 

111:  unused 


Lover  ROM  Fields  -  64  bits 


bits  0-4  Lover  A  Bus  Select 

00000:  I0P20  (no  drive)  (DEFAULT) 

00001:  AL=R1  (lover  A  bus  =  lover  Rl) 

00010 :  AL=R2 

I 

11001:  AL=R25 
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11010 

ALslIl 

(lovsr  A  bus  =  lovsr  IICI) 

11011 

AL-112 

11100 

AL=II3 

11101 

AL=CPT 

(lovsr  A  bus  -  C  pointsr) 

11110 

AL=DPT 

11111 

AL=NBR 

bits 

B-9  Lovsr 

B  Bus  Sslsct 

00000 

ir0P21 

00001 

BL=R1 

00010 

BL=R2 

1 

11001 

1 

BL=R26 

11010 

BL=II1 

11011 

BL=II2 

11100 

BL=II3 

11101 

BL=FP* 

(lovsr  B  =  rssult  of  FP  MULTIPLIER.  LSB’s) 

11110 

BL=FP+ 

(lovsr  B  =  rssult  of  FP  ADDER.  LSB's) 

mil 

BL=MBR 

bits 

10-14  Lossr 

C  Bus  Sslsct 

00000 

M0P22 

(no  load)  (DEFAULT) 

00001 

R1=CL 

(lovsr  R1  loads  from  tbs  lovsr  C  bus) 

00010 

R2=CL 

1 

11001 

1 

R2SsCL 

11010 

IR1=CL 

(lovsr  IICI  loads  from  ths  lovsr  C  bus) 

11011 

IH2=CL 

11100 

II3=CL 

11101 

CPT=CL 

(C  pointsr  loads  from  ths  lovsr  C  bus) 

11110 

DPT=CL 

11111 

NBR=CL 

(Lovsr  NBR  loads  from  lovsr  C  bus) 

bits 

lS-18  Lovsr  ALU  Sslsct 

0000: 

H0P23 

0000: 

NOVLR 

(SHIFTER.IIPUT=A.  flags  unaffected) (DEFAULT) 

0001: 

ORL  (SHIFTER.IIPUT  =  A  or  B) 

0010: 

AIDL 

(SHIFTER.IIPUT  =  A  and  B) 

0011: 

XORL 

(SHIFTER.IIPUT  =  A  xor  B) 

0100: 

KOVL 

(SHIFTER.IIPUT  =  A.  flags  affected) 

0101: 

RAMDL 

(SHIFTER.IIPUT  =  A  nand  B) 

0110: 

lORL  (SHIFTER.IIPUT  =  A  nor  B) 

0111: 

lOTL 

(SHIFTER.IIPUT  =  A') 

1000: 

IHCL 

(SHIFTER.IIPUT  =  A  +  1) 

1001  : 

SETL 

(set  carry  flag  flip-flop) 

1010: 

ADCL  (SHIFTER.IIPUT  =  A  +  B  +  cy  llip-llop) 

1011: 

ADDL 

(SHIFTER.IIPUT  =  A  +  B) 

1100: 

lEGAL  (SHIFTER.IIPUT  =  -A) 
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1101:  SUBL  (SHIFTER_IIPUT  =  A  -  B) 

1110:  SWBL  (SHIFTER_IIPUT  =  A  -  B  -  cy  llip-llop) 

1111:  DEa  (SHIFTEE.IIPUT  =  A  -1) 

bits  19-22  Lossr  Shift  Control  (IIPUTS  CONE  DIRECTLY  FROM  ALU) 

0000:  I0P24  (SHIFTER  DOES  HOT  DRIVE  C  BUS)  (DEFAULT) 

0001:  GIDCL  (C  ^  0  short  C  bos  to  GID.  shift  flags  unaffectsd) 

0010:  PASSL  (C  -  SHIFTER  IIPUT.  shift  flags  unaffectsd) 

0011:  SLOTL  (SHL  «ith  SH.OUT  into  LSB) 

0100:  SLNSL  (circulate  left  with  MSB  into  LSB) 

0101:  SLCYL  (SHL  with  CY  out  of  ALU  into  LSB) 

0110:  SLOL  (SHL  sith  0  into  LSB) 

0111:  SLIL  (SHL  sith  1  into  LSB) 

1000:  SRLSL  (circulate  right  uith  ALU  LSB  into  MSB) 

1001:  SRCFL  (SHR  sith  carry  flip-flop  into  MSB) 

1010:  SRSL  (SHR  sith  SIGH  FF  into  MSB) 

1011;  SROTL  (SHR  with  SH.OUT  FF  into  MSB) 

1100:  SRSEL  (SHR  vith  sign  extension  =  MSB  into  MSB) 

1101:  SRCYL  (SHR  with  CY  out  of  ALU  into  MSB) 

1110:  SROL  (SHR  vith  0  into  MSB) 

1111:  SRIL  (SHR  vith  1  into  MSB) 

bits  23-24  Barrel  Shifter  Set-up 

00:  SHROK  (shift  using  ROM  control  input)  (DEFAULT) 

01:  SHREG  (shift  using  control  register  input) 

10:  L.RON  (Load  Bar  Ctrl  reg.  use  RON  input  to  shift) 

11:  L.REG  (Load  Bar  Ctrl  reg,  use  old  reg  value  to  shift) 

NOTE:  If  you  just  vant  to  load  the  reg,  use  L.RON  and  H0P25  belov. 

bits  25-29  Barrel  Shifter  Control 

00000:  I0P25  (Do  not  drive  C  bus)  (DEFAULT) 

00001:  LCSl  (1  bit  left  circular  shift) 

00010:  LCS2  (2  bit  left  circular  shift) 

I 

mil:  LCS31  (31  bit  left  circular  shift) 

bits  30-32  Lover  Incrementable  Registers  Control 


000: 

I0P26 

(no  increment) (DEFAULT) 

001: 

Lilli- 

( increment 

lover 

IHCl) 

010: 

LII2i- 

( increment 

lover 

IKC2) 

100: 

LIH3+ 

(increment 

lover 

IIC3) 

bits  33-35  C,D  Pointer  Controls 


000: 

I0P27 

(no  change)  (DEFAULT) 

001: 

CII.L 

(load  C 

increment 

size) 

010: 

DIH.L 

(load  D 

increment 

size) 

oil:  COII_L  (load  C,D  incraaaat  sizas) 

100:  CPT*  (incraaant  C  pointer) 

101:  C-t-.CL  (increnent  C  pointer,  load  C  increnent) 

110:  D-*-_DL  (Increnent  D  pointer,  load  D  increnent) 

111:  DPT-t’  (increnent  D  pointer) 

bita  36-37  Loser  Nenory  Buffer  Registers  Control 

00:  I0P28  (no  action)  (DEFAULT) 

01:  Rl^DL  (Load  loser  R1  fron  loser  D  bus) 

10:  R2-DL  (Load  loser  R2  fron  loser  D  bus) 

11:  NB=0L  (Load  loser  MBR  fron  loser  0  bus) 

bits  38-39  Loser  Kenory  Address  Register  Control 

00:  MARL  (loser  MAR  drives  loser  address  bus)  (DEFAULT) 

01:  MARaCL  (drive  loser  addr  bus  and  load  MAR  fron  loser  C  bus) 

10:  MARsRL  (drive  lover  addr  bus  and  loeid  MAR  fron  loser  E  bus) 

11:  IMARL  (loser  MAR  does  not  drive  loser  address  bus, 
and  address  pads  go  to  high  inpedence) 

bits  40-41  Loser  E  Bus  Select 

00:  I0P29  (no  drive)  (DEFAULT) 

01:  E^CPT  (drive  C  pointer  onto  loser  E  bus) 

10:  E-DPT  (drive  D  pointer  onto  loser  E  bus) 

11:  MARL-i-  (increnent  loser  MAR) 

bit  42  A  Bus  Tie 

0:  IQP30 

1:  ATIE  (tie  the  upper  and  loser  A  busses  together) 
bit  43  B  Bus  Tie 
0:  I0P31 

1:  BTIE  (tie  the  upper  and  loser  B  busses  together) 

bit  44  Loser  Memory  Chip  Select  Bar 

0:  V0P33  (active  los)  (DEFAULT  VALUE:  chip  selected) 

1 :  LCS_H 

bit  45  Loser  Memory  Write  Enable  Bar 
0:  WEBL  (active  los) 

1:  R0P34  (DEFAULT  VALUE:  srite  disabled,  =read  mem) 

bit  46  Loser  Memory  Output  Enable  Bar 

0:  R0P3S  (active  los)  (DEFAULT  VALUE:  output  enabled) 
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1 :  LOE.B 


bit  47  DOIE  FLAG 

0:  MOP  36  (no  action)  (DEFAULT) 

1:  DOIE  (raisa  dona  flag) 

bits  48-63  lazt  Addrass/Litaral  Fiald 

(Litarals  aza  conposad  ol  tha  bit  pattarn  pracadad  by  #) 
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Appendix  B.  Newton- Raphson  Inversion  Routine 

■•vton  -  Raphaon  Invarsion  ROUTIfE  —  J  COKTOIS 

UKATIOI  OF  PASSED  PARAMETERS: 

UR25/LR25  :  tha  wiaibar  to  ba  invartad 

RETURIS:  result  of  invarsion  in  t7R23/LR23 

CHARGES:  UR23  LR23  UR24  LR24  IRIL 
FP*  FP+  ragiatara 

ROTES:  IRIL  is  tha  loop  countar ;  it  is 

sat  to  -3  at  tha  start,  ands  at  0 

ARALYSIS:  usas  31  clock  cyclas 

RESTRICTIORS: 

HISTORY:  Updatad  to  rav.  3.0  microcoda. 


IRI:  BU=R2S  R23=CU  RARDU  PASSU  IMZL 

R23=CL  GRDCL  #1111111111110000; 

I 

;  This  line  of  coda  does  tha  following; 

;  UR23<s(UR2S  HARD  MASK).  LR23<»0 

;  Tha  numbar  to  ba  invartad  is  usad  to  ganarata  tha  saad  for 

;  this  routina.  A  mask  is  insarted  onto  tha  louar  A  bus.  passed 

;  to  tha  upper  A  bus.  and  used  to  isolate  tha  sign  bit  and  the 
;  exponent  bits  of  tha  number  to  be  inverted  (A) .  This  exponent 

;  e  has  a  binary  value  of  a-1023  in  IEEE  format.  This  line  of 

;  coda  inverts  that  a  as  sail  as  masking  out  tha  mantissa. 

;  LR23  is  cleared:  it  sill  ba  tha  LSB’s  of  tha  seed’s  mantissa. 
;  which  are  0.  ROTE  that  tha  MSB’s  of  tha  seed  mantissa  have 
;  bean  sat  to  1  by  RARDing  them  with  0.  This  will  ba  fixed 
;  later  by  cla^u^ing  these  with  the  literal  ins. 


BU=R23  R23=CU  AODU  PASSU  IMZL 

AOllllllllllOOOOO; 

This  line  of  coda  does  tha  following: 

UR23<=(UR23  +  1022) 

Tha  exponent  must  ba  negated  and  decramanted  by  1  to  gat  the 
exponent  for  tha  seed  (-e-1) .  This  is  dona  by  inverting  the 
exponent  and  adding  1022.  Tha  invarsion  was  dona  by  RARD  in 
tha  previous  line,  tha  addition  of  1022  is  dona  by  this  line 
of  coda.  The  result  is  -e-1  in  IEEE  format.  The  sign  bit 
gats  ra-invertad  back  to  its  original  value  as  a  result  of 
tha  carry  out  of  tha  exponent  addition. 
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BUsR25  R24=CU  RCP 

R24sCL  GIDCL; 

Tbis  line  of  cod«  do«a  th«  folloviag: 

UR24<s8««d  froB  function  RON.  LR24<sO 
Tho  function  RON  is  nssd  to  gansr&to  tbs  4  NSB’s  of  the 
■antissa  of  tbs  assd  (initial  gnsss  at  tbs  rssult).  It  usss 
tbs  lovsr  4  bits  of  tbs  nppsr  byts  of  tbs  nppsr  B  bus  (ubicb 
corrsspond  to  tbs  4  NSB's  in  tbs  mantissa  of  a  floating 
point  nuBbsr)  to  cbooss  tbs  sssd.  LR24  is  clsarsd,  it  will 
bscoBS  tbs  LSB's  of  a  floating  point  "2". 


AUsR23  BU=R24  R23=CU  ORU  PASSU  INZL 
«1111111111110000; 

This  lins  of  cods  doss  tbs  following: 

UR23  <s  C  UR24  OR  (UR23  AID  NASK)  ] 

Tbs  sssd's  mantissa  (UR24)  is  OR'd  with  tbs  altsrsd 
sxponsnt  -s-1  (UR23)  to  form  tbs  total  sssd,  in  propsr 
IEEE  floating  point  format,  lots  bow  tbs  litsral  inssrtsr 
has  bssn  ussd  to  clsar  tbs  NSB’s  of  tbs  mantissa: 
tbs  sign/sxponsat/4  NSB’s  ars  drivsn  onto  tbs  uppsr  A  bus 
from  UR23,  tbsn  tbs  litsral  inssrtsr  ALSO  drives  tbs  upper 
A.bus.  Tbs  litsral  chosen  is  1111111111110000/00. . .00 
(all  O’s  in  tbs  LSB’s  by  using  INZL). 

What  happens  is  an  AID:  tbs  litsral  bits  that 
ars  1  will  not  affect  tbs  bits  drivsn  from  UR23,  but  the 
bits  that  are  0  will  override  the  bits  from  UR23  due  the 
operation  of  a  precharged  bus. 


R24=CU  PASSU  INZL 

#0 100000000000000 ; 

This  line  of  code  does  the  following: 

UR24<=literal,  s  floating  point  2 
A  floating  point  "2"  is  needed  by  the  algorithm.  This 
line  loads  the  upper  R24  with  the  proper  sign,  exponent 
and  NSB’s.  The  Lower  R24  holds  the  rest  of  the  mantissa, 
which  is  all  O’s.  A  IEEE  floating  point  "2”  looks  like  this, 
in  UR24/LR24: 

0  positive  sign 

10000000000  ''+1"  IEEE 

0000/00  -  00  mantissa  =  52  O’s. 


AU=R25  BU=R23  FP*L  TRU 
AL=R25  BL=R23  RLP2; 
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FP*  ILIL 

Ill^CL  NOVL  PASSL  »1111111111111101; 

These  lines  ol  code  do  the  follosing: 

load  the  multplier  vith  R2S.R23.  IllL<=-3,  juap  into  loop 
on  3rd  line  These  lines  load  the  multiplier  uith  A  and  X. 
and  do  the  multiply.  The  loop  is  entered  on  the  third  line 
because  the  first  2  lines  load  the  "X”  calculated  in  the 
previous  iteration.  For  the  first  iteration  “X”  is  the  seed 
created  above.  The  line  eith  FP*  gets  done  before  the  jump 
occurs  due  to  pipeline  delay.  The  louer  incrementable  register 
is  the  loop  counter,  4  loops  are  needed. 

«««<  TOP  OF  THE  LOOP  »>»» 

This  loop  does  the  follouing  three  calculations  four  times: 

AX  A  is  the  number  to  be  inverted. 

2-AX  X  is  the  latest  guess  at  1/A. 

X(2-AX)  =  new  X  for  the  next  iteration. 

Upon  completion  of  the  4  iterations,  X  is  the  result:  X=l/A. 


RLP:  AU=R25  BU=FPe  FP*LD 
AL=R25  BL=FP* ; 

FPe  TRPS  CALL 

TRAP; 

I 

;  These  lines  load  the  multiplier  uith  A  and  the  latest 
;  version  of  X.  The  line  with  FP*  also  checks  the  trap 
;  conditions  generated  during  the  previous 
;  floating  point  operation. 


RLP2:  AU=R24  BU=FP*  FP*D  FP-L 
AL=R24  BL=FP*; 

FP-  TRPS  CALL 

TRAP : 

» 

;  These  lines  load  the  subtractor  with  2  euid  AX  from  the 
;  previous  mult.  The  line  uith  FP-  also  checks  the  trap 
;  conditions  generated  during  the  previous 
;  floating  point  operation. 


AU=R23  BU=FP+  FP*L  LIHII 
AL*R23  BL=FP+  RLP; 

FP*  TRPS  CALL 
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LII1+  TRAP; 


Th«s«  lines  load  the  ■oltiplier  sitli  X  and  the  result  of 
the  previous  subtr.  The  first  line  also  checks  to  see  if 
the  loop  counter  has  not  reached  0.  in  vhich  case  the 
prograa  sill  jump  to  the  top  of  the  loop.  The  second  line 
is  done  before  the  juap,  due  to  the  pipeline  delay.  The 
line  sith  FPv  also  checks  the  trap  conditions  generated 
during  the  previous  floating  point  operation,  and 
increnents  the  loop  counter. 

»»>»  BOTTOM  OF  THE  LOOP  <««« 


R23*CU  FP*D  TRU  RET 
R23=CL: 

This  line  of  code  does  the  following: 

R23<=F.P.  Mult  output,  return  to  calling  routine 
The  final  version  of  X  is  in  the  multiplier.  This  line 
puts  that  result  into  R23  2uid  unconditionally  returns  to 
the  calling  routine.  The  last  line  of  code  in  this  routine 
(belos)  will  be  done  before  tr.j  return  jump,  so  it  is  used 
to  check  for  a  trap  condition  occurring  on  the  last  multiply. 


TRPS  CALL 
TRAP; 


TRAP:  nop; 
end; 


placeholder  for  unwritten  TRAP  routine 


B.l  Dot  Product  Routine 


DOT  PRODUCT  MICROCODE  SUBROUTIIE  ~  J  CONTOIS 

LOCATIOI  OF  PASSED  PARAMETERS: 

MAR  and  CPTR:  Pointer  to  C  vector 

BIIC  :  distance  between  eleaents  of  C  vector  in  memory 

APTR  :  Pointer  to  A  vector 

AIIC  :  distance  between  elements  of  A  vector  in  memory 

LIICl  :  -I  the  length  of  the  vectors 

UR22  :  pointer  to  memory  location  for  result  (optional) 

RETURIS:  result  in  NBR,  memory  address  of  result  in  MAR 
CHA16ES:  LIICI,  Rl.  NBR,  MAR,  APTR.  CPTR.  FP*  and  FP-t-  REGISTERS 
lOTES:  Rl  holds  C  vector  elements,  NBR  holds  A  vector  elements 
ANALYSIS:  2t+6  clock  cycles  for  vectors  of  length  I 
RESTRICTION:  N  must  be  greater  than  or  equal  to  1 

HISTORY;  Updated  to  3.0  revision  of  the  FPASP  microcode. 

Additional  comments  sere  added  by  G.  Morris,  one 
of  the  EE588  students. 


;  At  this  point,  MAR  points  to  the  first  C  vector  element,  APTR  and  CPTR 
;  point  to  the  A  and  C  vectors,  the  A  and  C  incr  registers  have  the 
;  distance  between  successive  elements,  and  LIICl  has  the  vector  size. 

» 

;  This  line  of  code  accomplishes  the  following  actions: 

;  clear  R2S,  load  Rl  <-  C[0],  load  NAR  <*  A[0]  addr,  inc  A  and  C  ptrs 

DOTP:  R2S=CU  GNDCU  APT+  R1=DU  MAR=EU  ETIE  E=APT 

R26=CL  GNDCL  CPT+  R1=DL  NAR=EL  ; 

;  Now  we  have  CCO]  in  Rl,  NAR  points  to  A[0],  R25  contains  zero,  the 
;  A  and  C  pointers  point  to  the  second  vector  elements. 

I 

;  This  line  of  code  accomplishes  the  following  actions: 

:  cle^u:  FP  adder  regs,  load  NBR  <-  ACo3,  load  NAR  <-  C[l]  addr,  inc  C  ptr 

AU=R25  BU=R2S  FP+L  NB=DU  NAR=EU  ETIE 

AL=R25  BL=R26  CPT+  NB=DL  MAR=EL  E=CPT: 

;  NBR  contains  ACO] ,  Rl  contains  CCO],  the  FP+  regs  contain 

;  zero.  NAR  has  the  address  of  CCl3  and  C  ptr  points  to  B[2],  A  ptr  to  A[l] 

> 

;  This  line  of  code  accomplishes  the  following  actions; 

;  FP  add  O's,  load  FP  multiplier  regs  with  ACO]  and  CCO] .load  Rl  <-  CCl] 

;  load  NAR  <-  A  Cl]  addr,  inc  A  ptr,  and  increment  the  loop  counter 

AU=NBR  BU=R1  FPvL  FP+  APT+  R1=DU  MAR=EU  ETIE  E=APT 

AL=NBR  BL=R1  LINU  R1=DL  NAR=EL; 
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;  kt  th«  top  of  this  loop,  th«  conditions  ars  as  lolloss: 

;  ths  anltiplisr  registers  contain  the  next  tso  elements  to  be  multiplied 
;  R25  has  the  last  product,  NBR  has  the  next-^l  A,  and  R1  hu  the  next-t-l  C 

;  the  FP-*-  registers  contain  the  partial  sum,  the  A  and  C  ptrs  point  to 

;  the  next-f2  A  and  C  elements.  lOTE:  The  conditional  branch  instruction 

;  shich  intuitively  belongs  on  the  next  line  of  code  is  physicsdly  on  this 

;  line  of  code  to  accomodate  the  pipeline  register  delay.  Even  if  the  branch 
;  fails,  e.g.,  «e  fall  thru,  the  next  instruction  HILL  be  executed. 

I 

;  This  line  of  code  accomplishes  the  follouing: 

;  multiply,  put  R2S  (the  previous  product)  and  the  sum  so  far  (the  result  of 
;  the  current  FP  adder  registers)  back  into  the  adder  regs,  load  MBR  <-  next  A 
;  load  MAR  <-  next  C  address,  and  inc  C  ptr 

DPLP:  AU=R26  BU=FP+  FP*  FP+L  MB=DU  MAR=EU  ETIE  LIRIH  BR 

AL=R25  BL=FP+  CPT+  lffl=DL  MAR=EL  E=CPT  DPLP; 

;  At  this  point,  the  multiply  has  finished,  the  next  C  addr  is  in  the 
;  MAR,  R1  and  NBR  have  the  neu  A  and  C  values,  and  the  adder  is  loaded 
;  with  the  partial  sum  and  the  last  product. 

I 

;  This  line  of  code  accomplishes  the  follouing; 

;  Load  R2S  <-  product,  load  FP*  regs  fr<»  R1  and  NBR,  add  the 
;  FP+  partial  sun,  load  R1  <-  next  C  value,  load  MAR  <-  next  A  addr 
;  inc  A  ptr,  inc  the  loop  counter,  if  any  errors,  h2mdle  them 
;  NOTE:  There  is  a  branch  after  this  instruction  via  the  previous  instr! 

AU=NBR  BU=R1  R25=CU  FP*LD  FP+  APT+  Rl=DU  NAR=EU  ETIE  E=APT  TRPS  CALL 
AL=NBR  BL=R1  R25=CL  LIll*  R1=DL  MARVEL  TRAP; 

;  At  this  point,  the  last  product  is  already  in  R25,  and  the  partial 
;  sum  has  been  computed  and  is  available  at  the  adder  output . 

t 

;  This  line  of  code  accomplishes  the  folloving; 

;  Load  the  last  product  term  and  the  partial  sun  into  the  FP  adder  regs 

AU=R26  BU=FP+  FP+L 
AL=R25  BL=FP+: 

;  The  adder  is  loaded  uith  the  required  operands. 

» 

;  This  line  of  code  does  the  addition,  and  loads  the  MAR  with  the 
;  destination  address  contained  in  R22.  It  also  does  an  unconditional 
;  return,  but  as  before,  there  is  a  one  cycle  delay  due  to  the  pipeline 

AU=R22  PASSU  FP+  NAR=CU  TRU  RET 

PASSL  MAR=CL  ATIE; 

;  At  this  point,  the  FP+  has  finished  adding,  and  the  final  result  is 
;  waiting  at  the  output. 
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;  Tbis  instruction  loeuls  MBR  sith  ths  final  dot  product 
;  and  than  rstums  to  tbs  calling  nicrocods  routine  (note,  the 
;  ret  instruction  sas  actually  stated  in  the  previous  line  of  code) . 

MBR=CU  FP+D 
NBR==CL  ; 

I 

;  The  routine  returned  to  must  do  the  vrite  to  aenory  of  the  result  of  OOTP 

t 

TRAP:  nop  ; placeholder  for  trap  routine 
end; 
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